Appeared in: Proceedings of International Conference on Artificial Neural Networks, ICANN-95, F. FogelmanSoulie and P. Gallinari (eds.), EC2 et Cie, Paris, 1995, pp. 3-7.
Contextual Relations of Words
in Grimm Tales,
Analyzed by Self-Organizing Map
Timo Honkela, Ville Pulkki and Teuvo Kohonen
Helsinki University of Technology
Neural Networks Research Centre
Rakentajanaukio 2 C, FIN-02150, Espoo, FINLAND
tel: +358 451 3276, fax: +358 451 3277
Semantic roles of words in natural languages are reflected by the contexts in which they occur. These roles can explicitly be visualized by the Self-Organizing Map (SOM). In the experiments reported in this work the source data consisted of the raw text of Grimm fairy tales without any prior syntactic or semantic categorization of the words. The algorithm was able to create diagrams that seem to comply reasonably well with the traditional syntactical categorizations and human intuition about the semantics of the words.
1 Processing Natural Language with Self-Organizing Map
It has earlier been shown that the Self-Organizing Map (SOM) can be applied to the visualization of contextual roles of words, i.e., similarities in their usage in short contexts formed of adjacent words . This paper demonstrates that such relations or roles are also statistically reflected in unrestricted, even quaint natural expressions. The source material chosen for this experiment consisted of 200 Grimm tales (English translation).
In most practical applications of the SOM, the input to the map algorithm is derived from some measurements, usually after their preprocessing. In such cases, the input vectors are supposed to have metric relations. Interpretation of languages, on the contrary, must be based on the processing of sequences of discrete symbols. If the words were encoded numerically, the ordered sets formed of them could also be compared mutually as well as with reference expressions. However, as no numerical value of the code should imply any order to the words themselves, it will be necessary to use uncorrelated vectors for encoding. The simplest method to introduce uncorrelated codes is to assign a unit vector for each word. When all different words in the input material are listed, a code vector can be defined to have as many components as there are words in the list. This method, however, is only practicable in very small experiments. If the vocabulary is large as in the present experiments, we may then encode the words by quasi-orthogonal random vectors of a much smaller dimensionality .
To create a map of discrete symbols that occur within the sentences, each symbol must be presented in the due context. The context may consist of the immediate surroundings of the word in the text.
Application of the self-organizing maps to natural language processing has been described earlier in, e.g., , , , , and .