A. Montoyo, A. Suarez, G. Rigau and M. Palomar
Volume 23, 2005
Links to Full Text:
Journal of Artiï¬cial Intelligence Research 23 (2005) 299-330 Submitted 07/04; published 03/05 Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Andres Montoyo montoyo@dlsi.ua.es Dept. of Software and Computing SystemsUniversity of Alicante, Spain Armando Su´ arez armando@dlsi.ua.es Dept. of Software and Computing SystemsUniversity of Alicante, Spain German Rigau rigau@si.ehu.es IXA Research GroupComputer Science DepartmentBasque Country University, Donostia Manuel Palomar mpalomar@dlsi.ua.es Dept. of Software and Computing SystemsUniversity of Alicante, Spain Abstract In this paper we concentrate on the resolution of the lexical ambiguity that arises when a given word has several diï¬erent meanings. This speciï¬c task is commonly referred toas word sense disambiguation (WSD). The task of WSD consists of assigning the correctsense to words using an electronic dictionary as the source of word deï¬nitions. We presenttwo WSD methods based on two main methodological approaches in this research area: aknowledge-based method and a corpus-based method. Our hypothesis is that word-sensedisambiguation requires several knowledge sources in order to solve the semantic ambiguityof the words. These sources can be of diï¬erent kindsâ for example, syntagmatic, paradig-matic or statistical information. Our approach combines various sources of knowledge,through combinations of the two WSD methods mentioned above. Mainly, the paper con-centrates on how to combine these methods and sources of information in order to achievegood results in the disambiguation. Finally, this paper presents a comprehensive study andexperimental work on evaluation of the methods and their combinations. 1. Introduction Knowledge technologies aim to provide meaning to the petabytes of information contentthat our multilingual societies will generate in the near future. Speciï¬cally, a wide range ofadvanced techniques are required to progressively automate the knowledge lifecycle. Theseinclude analyzing, and then automatically representing and managing, high-level meaningsfrom large collections of content data. However, to be able to build the next generationof intelligent open-domain knowledge application systems, we need to deal with conceptsrather than words. c 2005 AI Access Foundation. All rights reserved.
Montoyo, Su´ arez, Rigau, & Palomar 1.1 Dealing with Word Senses In natural language processing (NLP), word sense disambiguation (WSD) is deï¬ned as thetask of assigning the appropriate meaning (sense) to a given word in a text or discourse.As an example, consider the following three sentences: 1. Many cruise missiles have fallen on Baghdad. 2. Music sales will fall by up to 15% this year. 3. U.S. oï¬cials expected Basra to fall early. Any system that tries to determine the meanings of the three sentences will need to represent somehow three diï¬erent senses for the verb fall. In the ï¬rst sentence, the missileshave been launched on Baghdad. In the second sentence, sales will decrease, and in thethird the city will surrender early. WordNet 2.0 (Miller, 1995; Fellbaum, 1998)1 containsthirty-two diï¬erent senses for the verb fall as well as twelve diï¬erent senses for the nounfall. Note also that the ï¬rst and third sentence belong to the same, military domain, butuse the verb fall with two diï¬erent meanings. Thus, a WSD system must be able to assign the correct sense of a given word, in these examples, fall, depending on the context in which the word occurs. In the examplesentences, these are, respectively, senses 1, 2 and 9, as listed below. ⢠1. fallââdescend in free fall under the inï¬uence of gravityâ (âThe branch fell from the treeâ; âThe unfortunate hiker fell into a crevasseâ). ⢠2. descend, fall, go down, come downââmove downward but not necessarily all the wayâ (âThe temperature is going downâ; âThe barometer is fallingâ; âReal estateprices are coming downâ). ⢠9. fallââbe capturedâ (âThe cities fell to the enemyâ). Providing innovative technology to solve this problem will be one of the main challenges in language engineering to access advanced knowledge technology systems. 1.2 Word-Sense Disambiguation Word sense ambiguity is a central problem for many established Human Language Tech-nology applications (e.g., machine translation, information extraction, question answering,information retrieval, text classiï¬cation, and text summarization) (Ide & V´eronis, 1998).This is also the case for associated subtasks (e.g., reference resolution, acquisition of sub-categorization patterns, parsing, and, obviously, semantic interpretation). For this reason,many international research groups are working on WSD, using a wide range of approaches.However, to date, no large-scale, broad-coverage, accurate WSD system has been built (Sny-der & Palmer, 2004). With current state-of-the-art accuracy in the range 60â70%, WSD isone of the most important open problems in NLP. 1. http://www.cogsci.princeton.edu/Ëwn/ 300
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Even though most of the techniques for WSD usually are presented as stand-alone techniques, it is our belief, following McRoy (1992), that full-ï¬edged lexical ambiguityresolution will require to integrate several information sources and techniques. In this paper, we present two complementary WSD methods based on two diï¬erent methodological approaches, a knowledge-based and a corpus-based methods, as well as severalmethods that combine both into hybrid approaches. The knowledge-based method disambiguates nouns by matching context with informa- tion from a prescribed knowledge source. WordNet is used because it combines the char-acteristics of both a dictionary and a structured semantic network, providing deï¬nitionsfor the diï¬erent senses of the English words and deï¬ning groups of synonymous words bymeans of synsets, which represent distinct lexical concepts. WordNet also organizes wordsinto a conceptual structure by representing a number of semantic relationships (hyponymy,hypernymy, meronymy, etc.) among synsets. The corpus-based method implements a supervised machine-learning (ML) algorithm that learns from annotated sense examples. The corpus-based system usually representslinguistic information for the context of each sentence (e.g., usage of an ambiguous word) inthe form of feature vectors. These features may be of a distinct nature: word collocations,part-of-speech labels, keywords, topic and domain information, grammatical relationships,etc. Based on these two approaches, the main objectives of the work presented in this paperare: ⢠To study the performance of diï¬erent mechanisms of combining information sources by using knowledge-based and corpus-based WSD methods together. ⢠To show that a knowledge-based method can help a corpus-based method to better perform the disambiguation process and vice versa. ⢠To show that the combination of both approaches outperforms each of the methods taken individually, demonstrating that the two approaches can play complementaryroles. ⢠Finally, to show that both approaches can be applied in several languages. In partic- ular, we will perform several experiments in Spanish and English. In the following section a summary of the background of word sense disambiguation is presented. Sections 2.1 and 2.2 describe the knowledge-based and corpus-based systemsused in this work. Section 3 describes two WSD methods: the speciï¬cation marks methodand the maximum entropy-based method. Section 4 presents an evaluation of our resultsusing diï¬erent system combinations. Finally, some conclusions are presented, along with abrief discussion of work in progress. 2. Some Background on WSD Since the 1950s, many approaches have been proposed for assigning senses to words incontext, although early attempts only served as models for toy systems. Currently, thereare two main methodological approaches in this area: knowledge-based and corpus-basedmethods. Knowledge-based methods use external knowledge resources, which deï¬ne explicit 301
Montoyo, Su´ arez, Rigau, & Palomar sense distinctions for assigning the correct sense of a word in context. Corpus-based methodsuse machine-learning techniques to induce models of word usages from large collections oftext examples. Both knowledge-based and corpus-based methods present diï¬erent beneï¬tsand drawbacks. 2.1 Knowledge-based WSD Work on WSD reached a turning point in the 1980s and 1990s when large-scale lexicalresources such as dictionaries, thesauri, and corpora became widely available. The workdone earlier on WSD was theoretically interesting but practical only in extremely limiteddomains. Since Lesk (1986), many researchers have used machine-readable dictionaries(MRDs) as a structured source of lexical knowledge to deal with WSD. These approaches,by exploiting the knowledge contained in the dictionaries, mainly seek to avoid the need forlarge amounts of training material. Agirre and Martinez (2001b) distinguish ten diï¬erenttypes of information that can be useful for WSD. Most of them can be located in MRDs, andinclude part of speech, semantic word associations, syntactic cues, selectional preferences,and frequency of senses, among others. In general, WSD techniques using pre-existing structured lexical knowledge resources diï¬er in: ⢠the lexical resource used (monolingual and/or bilingual MRDs, thesauri, lexical knowl- edge base, etc.); ⢠the information contained in this resource, exploited by the method; and⢠the property used to relate words and senses. Lesk (1986) proposes a method for guessing the correct word sense by counting word overlaps between dictionary deï¬nitions of the words in the context of the ambiguous word.Cowie et al. (1992) uses the simulated annealing technique for overcoming the combinatorialexplosion of the Lesk method. Wilks et al. (1993) use co-occurrence data extracted from anMRD to construct word-context vectors, and thus word-sense vectors, to perform a large setof experiments to test relatedness functions between words and vector-similarity functions. Other approaches measure the relatedness between words, taking as a reference a struc- tured semantic net. Thus, Sussna (1993) employs the notion of conceptual distance betweennetwork nodes in order to improve precision during document indexing. Agirre and Rigau(1996) present a method for the resolution of the lexical ambiguity of nouns using the Word-Net noun taxonomy and the notion of conceptual density. Rigau et al. (1997) combine aset of knowledge-based algorithms to accurately disambiguate deï¬nitions of MRDs. Mihal-cea and Moldovan (1999) suggest a method that attempts to disambiguate all the nouns,verbs, adverbs, and adjectives in a given text by referring to the senses provided by Word-Net. Magnini et al. (2002) explore the role of domain information in WSD using WordNetdomains (Magnini & Strapparava, 2000); in this case, the underlying hypothesis is thatinformation provided by domain labels oï¬ers a natural way to establish semantic relationsamong word senses, which can be proï¬tably used during the disambiguation process. Although knowledge-based systems have been proven to be ready-to-use and scalable tools for all-words WSD because they do not require sense-annotated data (Montoyo et al., 302
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods 2001), in general, supervised, corpus-based algorithms have obtained better precision thanknowledge-based ones. 2.2 Corpus-based WSD In the last ï¬fteen years, empirical and statistical approaches have had a signiï¬cantly in-creased impact on NLP. Of increasing interest are algorithms and techniques that come fromthe machine-learning (ML) community since these have been applied to a large variety ofNLP tasks with remarkable success. The reader can ï¬nd an excellent introduction to ML,and its relation to NLP, in the articles by Mitchell (1997), Manning and Sch¨ utze (1999), and Cardie and Mooney (1999), respectively. The types of NLP problems initially addressed bystatistical and machine-learning techniques are those of language- ambiguity resolution, inwhich the correct interpretation should be selected from among a set of alternatives in aparticular context (e.g., word-choice selection in speech recognition or machine translation,part-of-speech tagging, word-sense disambiguation, co-reference resolution, etc.). Thesetechniques are particularly adequate for NLP because they can be regarded as classiï¬cationproblems, which have been studied extensively in the ML community. Regarding automaticWSD, one of the most successful approaches in the last ten years is supervised learningfrom examples, in which statistical or ML classiï¬cation models are induced from semanti-cally annotated corpora. Generally, supervised systems have obtained better results thanunsupervised ones, a conclusion that is based on experimental work and international com-petitions2. This approach uses semantically annotated corpora to train machineâlearning(ML) algorithms to decide which word sense to choose in which contexts. The words insuch annotated corpora are tagged manually using semantic classes taken from a particularlexical semantic resource (most commonly WordNet). Many standard ML techniques havebeen tried, including Bayesian learning (Bruce & Wiebe, 1994), Maximum Entropy (Su´arez& Palomar, 2002a), exemplar-based learning (Ng, 1997; Hoste et al., 2002), decision lists(Yarowsky, 1994; Agirre & Martinez, 2001a), neural networks (Towell & Voorhees, 1998),and, recently, margin-based classiï¬ers like boosting (Escudero et al., 2000) and supportvector machines (Cabezas et al., 2001). Corpus-based methods are called âsupervisedâ when they learn from previously sense- annotated data, and therefore they usually require a large amount of human interventionto annotate the training data (Ng, 1997). Although several attempts have been made (e.g.,Leackock et al., 1998; Mihalcea & Moldovan, 1999; Cuadros et al., 2004), the knowledgeacquisition bottleneck (too many languages, too many words, too many senses, too manyexamples per sense) is still an open problem that poses serious challenges to the supervisedlearning approach for WSD. 3. WSD Methods In this section we present two WSD methods based, respectively, on the two main method-ological approaches outlined above: a speciï¬cation marks method (SM) (Montoyo & Palo-mar, 2001) as a knowledge-based method, and a maximum entropy-based method (ME)(Su´ arez & Palomar, 2002b) as a corpus-based method. The selected methods can be seen 2. http://www.senseval.org 303
Montoyo, Su´ arez, Rigau, & Palomar as representatives of both methodological approaches. The speciï¬cation marks methodis inspired by the conceptual density method (Agirre & Rigau, 1996) and the maximumentropy method has been also used in other WSD systems (Dang et al., 2002). 3.1 Speciï¬cation Marks Method The underlying hypothesis of this knowledge base method is that the higher the similaritybetween two words, the larger the amount of information shared by two of its concepts. Inthis case, the information commonly shared by several concepts is indicated by the mostspeciï¬c concept that subsumes them in the taxonomy. The input for this WSD module is a group of nouns W = w1,w2,...,wn in a con- text. Each word wi is sought in WordNet, each having an associated set of possible sensesSi = Si1,Si2,...,Sin, and each sense having a set of concepts in the IS-A taxonomy (hy- pernymy/hyponymy relations). First, this method obtains the common concept to all thesenses of the words that form the context. This concept is marked by the initial speciï¬-cation mark (ISM). If this initial speciï¬cation mark does not resolve the ambiguity of theword, we then descend through the WordNet hierarchy, from one level to another, assigningnew speciï¬cation marks. For each speciï¬cation mark, the number of concepts containedwithin the subhierarchy is then counted. The sense that corresponds to the speciï¬cationmark with the highest number of words is the one chosen to be sense disambiguated withinthe given context. Figure 1 illustrates graphically how the word plant, having four diï¬erentsenses, is disambiguated in a context that also has the words tree, perennial, and leaf. Itcan be seen that the initial speciï¬cation mark does not resolve the lexical ambiguity, sincethe word plant appears in two subhierarchies with diï¬erent senses. The speciï¬cation markidentiï¬ed by plant2, ï¬ora2, however, contains the highest number of words (three) from the context and will therefore be the one chosen to resolve the sense two of the wordplant. The words tree and perennial are also disambiguated, choosing for both the senseone. The word leaf does not appear in the subhierarchy of the speciï¬cation mark plant2, ï¬ora2, and therefore this word has not been disambiguated. These words are beyond the scope of the disambiguation algorithm. They will be left aside to be processed by acomplementary set of heuristics (see section 3.1.2). 3.1.1 Disambiguation AlgorithmIn this section, we formally describe the SM algorithm which consists of the following ï¬vesteps: Step 1:All nouns are extracted from a given context. These nouns constitute the input context,Context = w1,w2,...,wn. For example, Context = plant,tree,perennial,leaf.Step 2:For each noun wi in the context, all its possible senses Si = Si1,Si2,...,Sin are obtained from WordNet. For each sense Sij, the hypernym chain is obtained and storedin order into stacks. For example, Table 1 shows all the hypernyms synsets for eachsense of the word Plant. Step 3:To each sense appearing in the stacks, the method associates the list of subsumed senses 304
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Inicial entity1 Specification Mark (ISM) object1 SM life form1 SM natural object1 substance1 part4 plant2, flora2 (*) artifact1 person1 material1 section4 plant part1 perennial1 vascular plant1 Structure1 paper1 entertainer1 plant organ1 leaf3 woody plant1 building complex1 sheet2 performer1 leaf1 leaf2 tree1 plant1 actor1 plant4 Figure 1: Speciï¬cation Marks plant1 plant2 plant3 plant4 building complex1 life form1 contrivance3 actor1 structure1 entity1 scheme1 performer1 artifact1 plan of action1 entertainer1 object1 plan1 person1 entity1 idea1 life form1 content5 entity1 cognition1psychological feature1 Table 1: Hypernyms synsets of plant from the context (see Figure 2, which illustrates the list of subsumed senses for plant1and plant2 ). Step 4:Beginning from the initial speciï¬cation marks (the top synsets), the program descendsrecursively through the hierarchy, from one level to another, assigning to each speciï¬-cation mark the number of context words subsumed. Figure 3 shows the word counts for plant1 through plant4 located within the speci-ï¬cation mark entity1, ..., life form1, ï¬ora2. For the entity1 speciï¬cation mark,senses 1, 2, and 4 have the same maximal word counts (4). Therefore, it is notpossible to disambiguate the word plant using the entity1 speciï¬cation mark, andit will be necessary to go down one level of the hyponym hierarchy by changing thespeciï¬cation mark. Choosing the speciï¬cation mark life form1, senses 2 and 4 ofplant have the same maximal word counts (3). Finally, it is possible to disambiguatethe word plant with the sense 2 using the plant2, ï¬ora2 speciï¬cation mark, because of this sense has the higher word density (in this case, 3). 305
Montoyo, Su´ arez, Rigau, & Palomar For PLANT: For PLANT1: plant1 plant1 building complex1 plant1 structure1 plant1 artifact1 plant1 object1 plant1, leaf1, leaf2, leaf3 entity1 plant1, plant2, plant4, tree1, perennial1, leaf1, leaf2, leaf3 For PLANT2: plant2 plant2, tree1, perennial1 life form1 plant2, plant4, tree1, perennial1 entity1 plant1, plant2, plant4, tree1, perennial1, leaf1, leaf2, leaf3 Figure 2: Data Structure for Senses of the Word Plant For PLANT located within the specification mark entity1 For PLANT1 : 4 (plant, tree, perennial, leaf) For PLANT2 : 4 (plant, tree, perennial, leaf) For PLANT3 : 1 (plant) For PLANT4 : 4 (plant, tree, perennial, leaf) â¦â¦â¦â¦â¦ located within the specification mark life form1 For PLANT1 : 1 (plant) For PLANT2 : 3 (plant, tree, perennial) For PLANT3 : 1 (plant) For PLANT4 : 3 (plant, tree, perennial) located within the specification mark plant 2, flora2 For PLANT1 : 1 (plant) For PLANT2 : 3 (plant, tree, perennial) For PLANT3 : 1 (plant) For PLANT4 : 1 (plant) Figure 3: Word Counts for Four Senses of the Word Plant Step 5:In this step, the method selects the word sense(s) having the greatest number of wordscounted in Step 4. If there is only one sense, then that is the one that is obviously chosen.If there is more than one sense, we repeat Step 4, moving down each level within thetaxonomy until a single sense is obtained or the program reach a leaf speciï¬cationmark. Figure 3 shows the word counts for each sense of plant (1 through 4) locatedwithin the speciï¬cation mark entity1, ..., life form1, ï¬ora2. If the word cannotbe disambiguated in this way, then it will be necessary to continue the disambiguationprocess applying a complementary set of heuristics. 3.1.2 Heuristics The speciï¬cation marks method is combined with a set of ï¬ve knowledge-based heuristics:hypernym/hyponym, deï¬nition, gloss hypernym/hyponym, common speciï¬cation mark,and domain heuristics. A short description of each of these methods is provided below. 306
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods 3.1.3 Hypernym/Hyponym Heuristic This heuristic solves the ambiguity of those words that are not explicitly related in WordNet(i.e., leaf is not directly related to plant, but rather follows a hypernym chain plus a PART-OF relation). All the hypernyms/hyponyms of the ambiguous word are checked, lookingfor synsets that have compounds that match with some word from the context. Eachsynset in the hypernym/hyponym chain is weighted in accordance with its depth within thesubhierarchy. The sense then having the greatest weight is chosen. Figure 4 shows that,leaf1 being a hyponym of plant organ1 is disambiguated (obtain the greatest weight,weight(leaf 1) = depthi=1 ( level ) = ( 4 ) + ( 5 ) = 1.5) because plant is contained within total levels 6 6 the context of leaf. Context: plant, tree, leaf, perennial Word non disambiguated: leaf. Senses: leaf1, leaf2, leaf3. For leaf1 => entity, something Level 1 => object, physical object Level 2 => natural object Level 3 => plant part Level 4 => plant organ Level 5 => leaf1, leafage, foliage Level 6 Figure 4: Application of the Hypernym Heuristic 3.1.4 Definition Heuristic In this case, all the glosses from the synsets of an ambiguous word are checked looking forthose that contain words from the context. Each match increases the synset count by one.The sense having the greatest count is then chosen. Figure 5 shows an example of thisheuristic. The sense sister1 is chosen, because it has the greatest weight. Context: person, sister, musician. Words non disambiguated: sister,musician. Senses: sister1, sister2, sister3 sister4. For sister1 Weight = 2 1. sister, sis - (a female person who has the same parents as another person; "my sister married a musician") For sister3 Weight = 1 3. sister - (a female person who is a fellow member (of a sorority or labor union or other group); "none of her sisters would betray her") Figure 5: Application of the Deï¬nition Heuristic 307
Montoyo, Su´ arez, Rigau, & Palomar 3.1.5 Gloss Hypernym/Hyponym Heuristic This method extends the previously deï¬ned hypernym/hyponym heuristic by using glossesof the hypernym/hyponym synsets of the ambiguous word. To disambiguate a given word,all the glosses of the hypernym/hyponym synsets are checked looking for words occurringin the context. Coincidences are counted. As before, the synset having the greatest count ischosen. Figure 6 shows a example of this heuristic. The sense plane1 is chosen, becauseit has the greatest weight. Context: plane, air Words non disambiguated: plane Senses: plane1, plane2, plane3, plane4, plane5. For Plane1: Weight = 1 airplane, aeroplane, plane - (an aircraft that has fixed a wing and is powered by propellers or jets; "the flight was delayed due to trouble with the airplane") => aircraft - (a vehicle that can fly) => craft - (a vehicle designed for navigation in or on water or air or through outer space) => vehicle - (a conveyance that transports people or objects) => conveyance, transport - (something that serves as a means of transportation) => instrumentality, instrumentation - (an artifact (or system of artifacts) that is instrumental in accomplishing some end) => artifact, artefact - (a man-made object) => object, physical object - (a physical (tangible and visible) entity; "it was full of rackets, balls and other objects") => entity, something - (anything having existence (living or nonliving)) Figure 6: Application of the Gloss Hypernym Heuristic 3.1.6 Common Specification Mark Heuristic In most cases, the senses of the words to be disambiguated are very close to each other andonly diï¬er in subtle diï¬erences in nuances. The Common Speciï¬cation Mark heuristic reducethe ambiguity of a word without trying to provide a full disambiguation. Thus, we selectthe speciï¬cation mark that is common to all senses of the context words, reporting all sensesinstead of choosing a single sense from among them. To illustrate this heuristic, considerFigure 7. In this example, the word month is not able to discriminate completely amongfour senses of the word year. However, in this case, the presence of the word month canhelp to select two possible senses of the word year when selecting the time period, period asa common speciï¬cation mark. This speciï¬cation mark represents the most speciï¬c commonsynset of a particular set of words. Therefore, this heuristic selects the sense month1 andsenses year1 and year2 instead of attempting to choose a single sense or leaving themcompletely ambiguous. 3.1.7 Domain WSD Heuristic This heuristic uses a derived resource, ârelevant domainsâ (Montoyo et al., 2003), which isobtained combining both the WordNet glosses and WordNet Domains (Magnini & Strap- 308
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Context: year, month. Words non disambiguated: year. Senses: year1, year2, year3, year4. For year1: For year2: For month1: => abstraction => abstraction => abstraction => measure, quantity => measure, quantity => measure, quantity => time period, period => time period, period => time period, period => year1, twelvemonth => year2 => month1 Figure 7: Example of Common Speciï¬cation Mark Heuristic parava, 2000)3. WordNet Domains establish a semantic relation between word senses bygrouping them into the same semantic domain (Sports, Medicine, etc.). The word bank,for example, has ten senses in WordNet 2.0, but three of them, âbank1â, âbank3âand âbank6â are grouped into the same domain label, Economy, whereas âbank2â andâbank7â are grouped into the domain labels Geography and Geology. These domain la-bels are selected from a set of 165 labels hierarchically organized. In that way, a domainconnects words that belong to diï¬erent subhierarchies and partâof-speech. âRelevant domainsâ is a lexicon derived from the WordNet glosses using WordNet Do- mains. In fact, we use WordNet as a corpus categorized with domain labels. For eachEnglish word appearing in the gloses of WordNet, we obtain a list of their most repre-sentative domain labels. The relevance is obtained weighting each possible label with theâAssociation Ratioâ formula (AR), where w is a word and D is a domain. P (w AR(w|D) = P(w|D) â log |D) (1) P (w) This list can also be considered as a weighted vector (or point in a multidimensional space). Using such word vectors of âRelevant domainsâ, we can derive new vectors torepresent sets of wordsâfor instance, for contexts or glosses. We can then compare thesimilarity between a given context and each of the possible senses of a polysemous wordâby using for instance the cosine function. Figure 8 shows an example for disambiguating the word genotype in the following text: There are a number of ways in which the chromosome structure can change, which willdetrimentally change the genotype and phenotype of the organism. First, the glosses of theword to be disambiguated and the context are posâtagged and analyzed morphologically.Second, we build the context vector (CV) which combines in one structure the most rele-vant and representative domains related to the words from the text to be disambiguated.Third, in the same way, we build the sense vectors (SV) which group the most relevant andrepresentative domains of the gloss that is associated with each one of the word senses. Inthis example, genotype1 â (a group of organisms sharing a speciï¬c genetic constitution)and genotype2 â (the particular alleles at speciï¬ed loci present in an organism). Finally,in order to select the appropriate sense, we made a comparison between all sense vectorsand the context vector, and we select the senses more approximate to the context vector. In 3. http://wndomains.itc.it/ 309
Montoyo, Su´ arez, Rigau, & Palomar this example, we show the sense vector for sense genotype1 and we select the genotype1 sense, b ecause its cosine is higher.  AR     Bio log y 0.00102837   AR     Ecology 0.00402855     Ecology 0.084778  Botany 3.20408e - 06  Biology 0.047627     Zoology 1.77959e - 05 Bowling 0.019687       CV =  Anatomy 1.29592e - 05 SV = y Archaeolog 0.016451      Physiology 1 0.00022653   Sociology 0.014251    Chemistry 0.000179857  on Alimentati 0.006510      Geology 1.66327e - 05  s Linguistic 0.005297     y Meteorolog 0.00371308 ï£ ... ...    ï£ ... ...  Sense vector genotype1. Context vector Selected sense genotype1 = 0.00804111 genotype2 = 0.00340548 Figure 8: Example of Domain WSD Heuristic Deï¬ning this heuristic as âknowledge-basedâ or âcorpus-basedâ can be seen controver- sial because this heuristic uses WordNet gloses (and WordNet Domains) as a corpus toderive the ârelevant domainsâ. That is, using corpus techniques on WordNet. However,WordNet Domains was constructed semi-automatically (prescribed) following the hierarchyof WordNet. 3.1.8 Evaluation of Specification Marks Method Obviously, we can also use diï¬erent strategies to combine a set of knowledge-based heuris-tics. For instance, all the heuristics described in the previous section can be applied inorder passing to the next heuristic only the remaining ambiguity that previous heuristicswere not able to solve. In order to evaluate the performance of the knowledge-based heuristics previously de- ï¬ned, we used the SemCor collection (Miller et al., 1993), in which all content words areannotated with the most appropriate WordNet sense.In this case, we used a window ofï¬fteen nouns (seven context nouns before and after the target noun). The results obtained for the speciï¬cation marks method using the heuristics when ap- plied one by one are shown in Table 2. This table shows the results for polysemous nounsonly, and for polysemous and monosemous nouns combined. 310
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Heuristics Precision Recall Coverage Polysemic and monosemic nouns 0.553 0.522 0.943 Only polysemic nouns 0.377 0.311 0.943 Table 2: Results Using Heuristics Applied in Order on SemCor The results obtained for the heuristics applied independently are shown in Table 3. As shown, all the heuristics perform diï¬erently, providing diï¬erent precision/recall ï¬gures. Heuristics Precision Recall Coverage Mono+Poly Polysemic Mono+Poly Polysemic Mono+Poly Polysemic Spec. Mark Method 0.383 0.300 0.341 0.292 0.975 0.948 Hypernym 0.563 0.420 0.447 0.313 0.795 0.745 Deï¬nition 0.480 0.300 0.363 0.209 0.758 0.699 Hyponym 0.556 0.393 0.436 0.285 0.784 0.726 Gloss hypernym 0.555 0.412 0.450 0.316 0.811 0.764 Gloss hyponym 0.617 0.481 0.494 0.358 0.798 0.745 Common speciï¬cation 0.565 0.423 0.443 0.310 0.784 0.732 Domain WSD 0.585 0.453 0.483 0.330 0.894 0.832 Table 3: Results Using Heuristics Applied Independently Another possibility is to combine all the heuristics using a majority voting schema (Rigau et al., 1997). In this simple schema, each heuristic provides a vote, and the method selectsthe synset that obtains more votes. The results shown in Table 4 illustrate that when theheuristics are working independently, the method achieves a 39.1% recall for polysemousnouns (with full coverage), which represents an improvement of 8 percentual points overthe method in which heuristics are applied in order (one by one). Precision Recall Mono+Poly Polysemic Mono+Poly Polysemic Voting heuristics 0.567 0.436 0.546 0.391 Table 4: Results using majority voting on SemCor We also show in Table 5 the results of our domain heuristic when applied on the English all-words task from Senseval-2. In the table, the polysemy reduction caused by domainclustering can proï¬tably help WSD. Since domains are coarser than synsets, word domaindisambiguation (WDD) (Magnini & Strapparava, 2000) can obtain better results than WSD.Our goal is to perform a preliminary domain disambiguation in order to provide an informedsearchâspace reduction. 3.1.9 Comparison with Knowledge-based Methods In this section we compare three diï¬erent knowledge-based methods: conceptual density(Agirre & Rigau, 1996), a variant of the conceptual density algorithm (Fern´ andez-Amor´ os et al., 2001); the Lesk method (Lesk, 1986) ; and the speciï¬cation marks method. 311
Montoyo, Su´ arez, Rigau, & Palomar Level WSD Precision Recall Sense 0.44 0.32 Domain 0.54 0.43 Table 5: Results of Use of Domain WSD Heuristic Table 6 shows recall results for the three methods when applied to the entire SemCor collection. Our best result achieved 39.1% recall. This is an important improvement withrespect to other methods, but the results are still far below the most frequent sense heuris-tic. Obviously, none of the knowledge-based techniques and heuristics presented above aresuï¬cient, in isolation, to perform accurate WSD. However, we have empirically demon-strated that a simple combination of knowledge-based heuristics can lead to improvementsin the WSD process. WSD Method Recall SM and Voting Heuristics 0.391 UNED Method 0.313 SM with Cascade Heuristics 0.311 Lesk 0.274 Conceptual Density 0.220 Table 6: Recall results using three diï¬erent knowledgeâbased WSD methods 3.2 Maximum Entropy-based Method Maximum Entropy modeling provides a framework for integrating information for classi-ï¬cation from many heterogeneous information sources (Manning & Sch¨ utze, 1999; Berger et al., 1996). ME probability models have been successfully applied to some NLP tasks,such as POS tagging or sentence-boundary detection (Ratnaparkhi, 1998). The WSD method used in this work is based on conditional ME models. It has been implemented using a supervised learning method that consists of building word-sense clas-siï¬ers using a semantically annotated corpus. A classiï¬er obtained by means of an MEtechnique consists of a set of parameters or coeï¬cients which are estimated using an opti-mization procedure. Each coeï¬cient is associated with one feature observed in the trainingdata. The goal is to obtain the probability distribution that maximizes the entropyâthatis, maximum ignorance is assumed and nothing apart from the training data is considered.One advantage of using the ME framework is that even knowledge-poor features may be ap-plied accurately; the ME framework thus allows a virtually unrestricted ability to representproblem-speciï¬c knowledge in the form of features (Ratnaparkhi, 1998). Let us assume a set of contexts X and a set of classes C. The function cl : X â C chooses the class c with the highest conditional probability in the context x: cl(x) = arg maxc p(c|x). Each feature is calculated by a function that is associated with a speciï¬c class c , and ittakes the form of equation (2), where cp(x) represents some observable characteristic in 312
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods the context4. The conditional probability p(c|x) is deï¬ned by equation (3), where αi is the parameter or weight of the feature i, K is the number of features deï¬ned, and Z(x)is a normalization factor that ensures that the sum of all conditional probabilities for thiscontext is equal to 1. 1 if c = c and cp(x) = true f (x, c) = (2) 0 otherwise 1 K p(c|x) = αfi(x,c) Z(x) i (3) i=1 The learning module produces the classiï¬ers for each word using a corpus that is syn- tactically and semantically annotated. The module processes the learning corpus in orderto deï¬ne the functions that will apprise the linguistic features of each context. For example, consider that we want to build a classiï¬er for the noun interest using the POS label of the previous word as a feature and we also have the the following threeexamples from the training corpus: ... the widespread interest1 in the ...... the best interest5 of both ...... persons expressing interest1 in the ... The learning module performs a sequential processing of this corpus, looking for the pairs <POS-label, sense>. Then, the following pairs are used to deï¬ne three functions(each context has a vector composed of three features). <adjective,1><adjective,5><verb,1> We can deï¬ne another type of feature by merging the POS occurrences by sense: < adjective,verb,1> <adjective,5> This form of deï¬ning the pairs means a reduction of feature space because all information (of some kind of linguistic data, e.g., POS label at position -1) about a sense is contained injust one feature. Obviously, the form of the feature function 2 must be adapted to Equation4. Thus, W(c ) = data of sense c (4) 1 if c = c and CP (x) f â W(c ) (c ,i)(x, c) = 0 otherwise 4. The ME approach is not limited to binary features, but the optimization procedure used for the estimation of the parameters, the Generalized Iterative Scaling procedure, uses this kind of features. 313
Montoyo, Su´ arez, Rigau, & Palomar We will refer to the feature function expressed by Equation 4 as âcollapsed featuresâ. The previous Equation 2 we call ânon-collapsed featuresâ. These two feature deï¬nitions arecomplementary and can be used together in the learning phase. Due to the nature of the disambiguation task, the number of times that a feature generated by the ï¬rst type of function (ânon-collapsedâ) is activated is very low, and thefeature vectors have a large number of null values. The new function drastically reducesthe number of features, with a minimal degradation in the evaluation results. In this way,more and new features can be incorporated into the learning process, compensating the lossof accuracy. Therefore, the classiï¬cation module carries out the disambiguation of new contexts using the previously stored classiï¬cation functions. When ME does not have enough informationabout a speciï¬c context, several senses may achieve the same maximum probability andthus the classiï¬cation cannot be done properly. In these cases, the most frequent sense inthe corpus is assigned. However, this heuristic is only necessary for a minimum number ofcontexts or when the set of linguistic attributes processed is very small. 3.2.1 Description of Features The set of features deï¬ned for the training of the system is described in Figure 9 and isbased on the features described by Ng and Lee (1996) and Escudero et al. (2000). Thesefeatures represent words, collocations, and POS tags in the local context. Both âcollapsedâand ânon-collapsedâ functions are used. ⢠0: word form of the target word ⢠s: words at positions ±1, ±2, ±3 ⢠p: POS-tags of words at positions ±1, ±2, ±3 ⢠b: lemmas of collocations at positions (â2,â1), (â1,+1), (+1,+2) ⢠c: collocations at positions (â2,â1), (â1,+1), (+1,+2) ⢠km: lemmas of nouns at any position in context, occurring at least m% times with a sense ⢠r: grammatical relation to the ambiguous word ⢠d: the word that the ambiguous word depends on ⢠m: multi-word if identiï¬ed by the parser ⢠L: lemmas of content-words at positions ±1, ±2, ±3 (âcollapsedâ deï¬nition) ⢠W : content-words at positions ±1, ±2, ±3 (âcollapsedâ deï¬nition) ⢠S, B, C, P, D and M : âcollapsedâ versions (see Equation 4) Figure 9: Features Used for the Training of the System Actually, each item in Figure 9 groups several sets of features. The majority of them depend on the nearest words (e.g., s comprises all possible features deï¬ned by the wordsoccurring in each sample at positions wâ3, wâ2, wâ1, w+1, w+2, w+3 related to the am- biguous word). Types nominated with capital letters are based on the âcollapsedâ functionform; that is, these features simply recognize an attribute belonging to the training data. Keyword features (km) are inspired by Ng and Lee work. Noun ï¬ltering is done using frequency information for nouns co-occurring with a particular sense. For example, let us 314
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods suppose m = 10 for a set of 100 examples of interest4 : if the noun bank is found 10 timesor more at any position, then a feature is deï¬ned. Moreover, new features have also been deï¬ned using other grammatical properties: re- lationship features (r) that refer to the grammatical relationship of the ambiguous word(subject, object, complement, ...) and dependency features (d and D) that extract the wordrelated to the ambiguous one through the dependency parse tree. 3.2.2 Evaluation of the Maximum Entropy Method In this subsection we present the results of our evaluation over the training and testing dataof the Senseval-2 Spanish lexicalâsample task. This corpus was parsed using ConexorFunctional Dependency Grammar parser for Spanish (Tapanainen & J¨arvinen, 1997). The classiï¬ers were built from the training data and evaluated over the test data. Table 7 shows which combination of groups of features works better for every POS and whichwork better for all words together. Accuracy Feature selection Nouns 0.683 LWSBCk5 Verbs 0.595 sk5 Adjectives 0.783 LWsBCp ALL 0.671 0LWSBCk5 Table 7: Baseline: Accuracy Results of Applying ME on Senseval-2 Spanish Data This work entails an exhaustive search looking for the most accurate combination of features. The values presented here are merely informative and indicate the maximumaccuracy that the system can achieve with a particular set of features. 3.3 Improving ME accuracy Our main goal is to ï¬nd a method that will automatically obtain the best feature selection(Veenstra et al., 2000; Mihalcea, 2002; Su´arez & Palomar, 2002b) from the training data.We performed an 3-fold cross-validation process. Data is divided in 3 folds; then, 3 testsare done, each one with 2 folds as training data and the remaining one as testing data. Theï¬nal result is the average accuracy. We decided on just three tests because of the smallsize of the training data. Then, we tested several combinations of features over the trainingdata of the Senseval-2 Spanish lexicalâsample and analyzed the results obtained for eachword. In order to perform the 3-fold cross-validation process on each word, some preprocessing of the corpus was done. For each word, all senses were uniformly distributed into the threefolds (each fold contains one-third of the examples of each sense). Those senses that hadfewer than three examples in the original corpus ï¬le were rejected and not processed. Table 8 shows the best results obtained using three-fold cross-validation on the training data. Several feature combinations were tested in order to ï¬nd the best set for each selectedword. The purpose was to obtain the most relevant information for each word from thecorpus rather than applying the same combination of features to all of them. Therefore,the information in the column Features lists only the feature selection with the best result. 315
Montoyo, Su´ arez, Rigau, & Palomar Word Features Accur MFS Word Features Accur MFS autoridad,N sbcp 0.589 0.503 clavar,V sbcprdk3 0.561 0.449 bomba,N 0LWSBCk5 0.762 0.707 conducir,V LWsBCPD 0.534 0.358 canal,N sbcprdk3 0.579 0.307 copiar,V 0sbcprdk3 0.457 0.338 circuito,N 0LWSBCk5 0.536 0.392 coronar,V sk5 0.698 0.327 coraz´on,N 0Sbcpk5 0.781 0.607 explotar,V 0LWSBCk5 0.593 0.318 corona,N sbcp 0.722 0.489 saltar,V LWsBC 0.403 0.132 gracia,N 0sk5 0.634 0.295 tocar,V 0sbcprdk3 0.583 0.313 grano,N 0LWSBCr 0.681 0.483 tratar,V sbcpk5 0.527 0.208 hermano,N 0Sprd 0.731 0.602 usar,V 0Sprd 0.732 0.669 masa,N LWSBCk5 0.756 0.455 vencer,V sbcprdk3 0.696 0.618 naturaleza,N sbcprdk3 0.527 0.424 brillante,A sbcprdk3 0.756 0.512 operaci´on,N 0LWSBCk5 0.543 0.377 ciego,A 0spdk5 0.812 0.565 ´organo,N 0LWSBCPDk5 0.715 0.515 claro,A 0Sprd 0.919 0.854 partido,N 0LWSBCk5 0.839 0.524 local,A 0LWSBCr 0.798 0.750 pasaje,N sk5 0.685 0.451 natural,A sbcprdk10 0.471 0.267 programa,N 0LWSBCr 0.587 0.486 popular,A sbcprdk10 0.865 0.632 tabla,N sk5 0.663 0.488 simple,A LWsBCPD 0.776 0.621 actuar,V sk5 0.514 0.293 verde,A LWSBCk5 0.601 0.317 apoyar,V 0sbcprdk3 0.730 0.635 vital,A Sbcp 0.774 0.441 apuntar,V 0LWsBCPDk5 0.661 0.478 Table 8: Three-fold Cross-Validation Results on Senseval-2 Spanish Training Data: Best Aver- aged Accuracies per Word Strings in each row represent the entire set of features used when training each classiï¬er. Forexample, autoridad obtains its best result using nearest words, collocations of two lemmas,collocations of two words, and POS information that is, s, b, c, and p features, respectively(see Figure 9). The column Accur (for âaccuracyâ) shows the number of correctly classiï¬edcontexts divided by the total number of contexts (because ME always classiï¬es precision asequal to recall). Column MFS shows the accuracy obtained when the most frequent senseis selected. The data summarized in Table 8 reveal that using âcollapsedâ features in the ME method is useful; both âcollapsedâ and ânon-collapsedâ functions are used, even for the same word.For example, the adjective vital obtains the best result with âSbcpâ (the âcollapsedâ versionof words in a window (â3.. + 3), collocations of two lemmas and two words in a window (â2.. + 2), and POS labels, in a window (â3.. + 3) too); we can here infer that single-word information is less important than collocations in order to disambiguate vital correctly. The target word (feature 0) is useful for nouns, verbs, and adjectives, but many of the words do not use it for their best feature selection. In general, these words do not have arelevant relationship between shape and senses. On the other hand, POS information (pand P features) is selected less often. When comparing lemma features with word features(e.g., L versus W , and B versus C), they are complementary in the majority of cases.Grammatical relationships (r features) and wordâword dependencies (d and D features)seem very useful, too, if combined with other types of attributes. Moreover, keywords (km 316
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods features) are used very often, possibly due to the source and size of contexts of Senseval-2Spanish lexicalâsample data. Table 9 shows the best feature selections for each part-of-speech and for all words. The data presented in Tables 8 and 9 were used to build four diï¬erent sets of classiï¬ers in orderto compare their accuracy: MEï¬x uses the overall best feature selection for all words;MEbfs trains each word with its best selection of features (in Table 8); MEbfs.pos usesthe best selection per POS for all nouns, verbs and adjectives, respectively (in Table 9); and,ï¬nally, vME is a majority voting system that has as input the answers of the precedingsystems. POS Acc Features System Nouns 0.620 LWSBCk5 Verbs 0.559 sbcprdk3 MEbfs.pos Adjectives 0.726 0spdk5ALL 0.615 sbcprdk3 MEï¬x Table 9: Three-fold Cross-Validation Results on Senseval-2 Spanish Training Data: Best Aver- aged Accuracies per POS Table 10 shows a comparison of the four systems. MEï¬x has the lower results. This classiï¬er applies the same set of types of features to all words. However, the best featureselection per word (MEbfs) is not the best, probably because more training examples arenecessary. The best choice seems to select a ï¬xed set of types of features for each POS(MEbfs.pos). ALL Nouns 0.677 MEbfs.pos 0.683 MEbfs.pos0.676 vME 0.678 vME 0.667 MEbfs 0.661 MEbfs 0.658 MEï¬x 0.646 MEï¬x Verbs Adjectives 0.583 vME 0.774 vME 0.583 MEbfs.pos 0.772 MEbfs.pos0.583 MEï¬x 0.771 MEbfs 0.580 MEbfs 0.756 MEï¬x MEï¬x: sbcprdk3 for all words MEbfs: each word with its best feature selection MEbfs.pos: LWSBCk5 for nouns, sbcprdk3 for verbs,and 0spdk5 for adjectives vME: majority voting between MEï¬x, MEbfs.pos, and MEbfs Table 10: Evaluation of ME Systems 317
Montoyo, Su´ arez, Rigau, & Palomar While MEbfs predicts, for each word over the training data, which individually selected features could be the best ones when evaluated on the testing data, MEbfs.pos is anaveraged prediction, a selection of features that, over the training data, performed a âgoodenoughâ disambiguation of the majority of words belonging to a particular POS. When thisaveraged prediction is applied to the real testing data, MEbfs.pos performs better thanMEbfs. Another important issue is that MEbfs.pos obtains an accuracy slightly better than the best possible evaluation result achieved with ME (see Table 7)âthat is, a best-feature-selection per POS strategy from training data guarantees an improvement on ME-basedWSD. In general, verbs are diï¬cult to learn and the accuracy of the method for them is lower than for other POS; in our opinion, more information (knowledge-based, perhaps) is neededto build their classiï¬ers. In this case, the voting system (vME) based on the agreementbetween the other three systems, does not improve accuracy. Finally in Table 11, the results of the ME method are compared with those systems that competed at Senseval-2 in the Spanish lexicalâsample task5. The results obtained by MEsystems are excellent for nouns and adjectives, but not for verbs. However, when comparingALL POS, the ME systems seem to perform comparable to the best Senseval-2 systems. ALL Nouns Verbs Adjectives 0.713 jhu(R) 0.702 jhu(R) 0.643 jhu(R) 0.802 jhu(R) 0.682 jhu 0.683 MEbfs.pos 0.609 jhu 0.774 vME 0.677 MEbfs.pos 0.681 jhu 0.595 css244 0.772 MEbfs.pos 0.676 vME 0.678 vME 0.584 umd-sst 0.772 css244 0.670 css244 0.661 MEbfs 0.583 vME 0.771 MEbfs 0.667 MEbfs 0.652 css244 0.583 MEbfs.pos 0.764 jhu 0.658 MEï¬x 0.646 MEï¬x 0.583 MEï¬x 0.756 MEï¬x 0.627 umd-sst 0.621 duluth 8 0.580 MEbfs 0.725 duluth 8 0.617 duluth 8 0.612 duluth Z 0.515 duluth 10 0.712 duluth 10 0.610 duluth 10 0.611 duluth 10 0.513 duluth 8 0.706 duluth 7 0.595 duluth Z 0.603 umd-sst 0.511 ua 0.703 umd-sst 0.595 duluth 7 0.592 duluth 6 0.498 duluth 7 0.689 duluth 6 0.582 duluth 6 0.590 duluth 7 0.490 duluth Z 0.689 duluth Z 0.578 duluth X 0.586 duluth X 0.478 duluth X 0.687 ua 0.560 duluth 9 0.557 duluth 9 0.477 duluth 9 0.678 duluth X 0.548 ua 0.514 duluth Y 0.474 duluth 6 0.655 duluth 9 0.524 duluth Y 0.464 ua 0.431 duluth Y 0.637 duluth Y Table 11: Comparison with the Spanish Senseval-2 systems 5. JHU(R) by Johns Hopkins University; CSS244 by Stanford University; UMD-SST by the University of Maryland; Duluth systems by the University of Minnesota - Duluth; UA by the University of Alicante. 318
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods 3.4 Comparing Speciï¬cation Marks Method with Maximum Entropy-based Method The main goal of this section is to evaluate the Speciï¬cation Marks Method (Montoyo &Su´ arez, 2001) and the Maximum Entropy-based Method (in particular, MEï¬x System) on a common data set, to allow for direct comparisons. The individual evaluation of eachmethod has been carried out on the noun set (17 nouns) of the Spanish lexical-sample task(Rigau et al., 2001) from Senseval-26. Table 12 shows precision, recall and coverage ofboth methods. Precision Recall Coverage SM 0.464 0.464 0.941 ME 0.646 0.646 1 Table 12: Comparison of ME and SM for nouns in Senseval-2 Spanish lexical sample In order to study a possible cooperation between both methods, we count those cases that: the two methods return the correct sense for the same occurrence, at least one of themethods provides the correct sense and ï¬nally, none of both provides the correct sense. Asummary of the obtained results is shown in the Table 13. These results clearly show thatthere is a large room for improvement when combining both system outcomes. In fact, theyprovide also a possible upper bound precision for this technology, which can be set to 0.798(more than 15 percentual points higher than the current best system). Table 14 presentsa complementary view: wins, ties and loses between ME and SM when each context isexamined. Although ME performs better than SM, there are 122 cases (15 %) solved onlyby the SM method. Contexts Percentage Both OK 240 0.300 One OK 398 0.498 Zero OK 161 0.202 Table 13: Correct classiï¬cations of ME and SM for nouns in Senseval-2 Spanish lexical sample Wins Ties Loses ME â SM 267 240 122 Table 14: Wins, ties and loses of ME and SM systems for nouns in Senseval-2 Spanish lexical sample 6. For both Spanish and English Senseval-2 corpora, when applying the Speciï¬cation Marks method we used the whole example as the context window for the target noun 319
Montoyo, Su´ arez, Rigau, & Palomar 4. Experimental Work In this section we attempt to conï¬rm our hypothesis that both corpus-based and knowledge-based methods can improve the accuracy of each other. The ï¬rst subsection shows theresults of preprocessing the test data with the maximum entropy method (ME) in order tohelp the speciï¬cation marks method (SM). Next, we test the opposite, if preprocessing thetest data with the domain heuristic can help the maximum entropy method to disambiguatemore accurately. The last experiment combines the vME system (the majority voting system) and SM method. Actually, it relies on simply adding the SM as one more heuristic to the votingscheme. 4.1 Integrating a Corpus-based WSD System into a Knowledge-based WSD System This experiment was designed to study and evaluate whether the integration of corpus-basedsystem within a knowledge-based helps improve word-sense disambiguation of nouns. Therefore, ME can help to SM by labelling some nouns of the context of the target word. That is, reducing the number of possible senses of some nouns of the context. In fact, wereduce the search space of the SM method. This ensures that the sense of the target wordwill be the one more related to the noun senses labelled by ME. In this case, we used the noun words from the English lexical-sample task from Senseval- 2. ME helps SM by labelling some words from the context of the target word. These wordswere sense tagged using the SemCor collection as a learning corpus. We performed a threeâfold cross-validation for all nouns having 10 or more occurrences. We selected those nounsthat were disambiguated by ME with high precision, that is, nouns that had percentagerates of accuracy of 90% or more. The classiï¬ers for these nouns were used to disambiguatethe testing data. The total number of diï¬erent noun classiï¬ers (noun) activated for eachtarget word across the testing corpus is shown in Table 15. Next, SM was applied, using all the heuristics for disambiguating the target words of the testing data, but with the advantage of knowing the senses of some nouns that formedthe context of these targets words. Table 15 shows the results of precision and recall when SM is applied with and without ï¬rst applying ME, that is, with and without ï¬xing the sense of the nouns that form thecontext. A very small but consistent improvement was obtained through the completetest set (3.56% precision and 3.61% recall). Although the improvement is very low, thisexperiment empirically demonstrates that a corpus-based method such as maximum entropycan be integrated to help a knowledge-based system such as the speciï¬cation marks method. 4.2 Integrating a Knowledge-based WSD system into a Corpus-based WSD system In this case, we used only the domain heuristic to improve ME because this information canbe added directly as domain features. The problem of data sparseness from which a WSDsystem based on features suï¬ers could be increased by the ï¬ne-grained sense distinctionsprovided by WordNet. On the contrary, the domain information signiï¬cantly reduces the 320
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Without ï¬xed senses With ï¬xed senses Target words noun classiï¬ers Precision Recall Precision Recall art 63 0.475 0.475 0.524 0.524 authority 80 0.137 0.123 0.144 0.135 bar 104 0.222 0.203 0.232 0.220 bum 37 0.421 0.216 0.421 0.216 chair 59 0.206 0.190 0.316 0.301 channel 32 0.500 0.343 0.521 0.375 child 59 0.500 0.200 0.518 0.233 church 50 0.509 0.509 0.540 0.529 circuit 49 0.356 0.346 0.369 0.360 day 136 0.038 0.035 0.054 0.049 detention 22 0.454 0.454 0.476 0.454 dyke 15 0.933 0.933 0.933 0.933 facility 14 0.875 0.875 1 1 fatigue 38 0.236 0.230 0.297 0.282 feeling 48 0.306 0.300 0.346 0.340 grip 38 0.184 0.179 0.216 0.205 hearth 29 0.321 0.310 0.321 0.310 holiday 23 0.818 0.346 0.833 0.384 lady 40 0.375 0.136 0.615 0.363 material 58 0.343 0.338 0.359 0.353 mouth 51 0.094 0.094 0.132 0.132 nation 25 0.269 0.269 0.307 0.307 nature 37 0.263 0.263 0.289 0.289 post 41 0.312 0.306 0.354 0.346 restraint 31 0.200 0.193 0.206 0.193 sense 37 0.260 0.240 0.282 0.260 spade 17 0.823 0.823 0.941 0.941 stress 37 0.228 0.216 0.257 0.243 yew 24 0.480 0.480 0.541 0.520 Total 1294 0.300 0.267 0.336 0.303 Table 15: Precision and Recall Results Using SM to Disambiguate Words, With and With- out Fixing of Noun Sense word polysemy (i.e., the number of categories for a word is generally lower than the numberof senses for the word) and the results obtained by this heuristic have better precision thanthose obtained by the whole SM method, which in turn obtain better recall. As shown in subsection 3.1.7, the domain heuristic can annotate word senses character- ized by their domains. Thus, these domains will be used as an additional type of featuresfor ME in a context window of ±1, ±2, and ±3 from the target word. In addition, the three more relevant domains were calculated also for each context and incorporated to thetraining in the form of features. This experiment was also carried out on the English lexical-sample task data from Senseval-2, and ME was used to generate two groups of classiï¬ers from the training data. The ï¬rst group of classiï¬ers used the corpus without information of domains; the second, having previously been domain disambiguated with SM, incorporating the domain label ofadjacent nouns, and the three more relevant domains to the context. That is, providing tothe classiï¬er a richer set of features (adding the domain features). However, in this case,we did not perform any feature selection. The test data was disambiguated by ME twice, with and without SM domain labelling, using 0lW sbcpdm (see Figure 9) as the common set of features in order to perform thecomparison. The results of the experiment are shown in Table 16. The table shows that 7 out of 29 nouns obtained worse results when using the domains, whereas 13 obtained better results. Although, in this case, we only obtained a very smallimprovement in terms of precision (2%)7. 7. This diï¬erence proves to be statistically signiï¬cant when applying the test of the corrected diï¬erence of two proportions (Dietterich, 1998; Snedecor & Cochran, 1989) 321
Montoyo, Su´ arez, Rigau, & Palomar Target words Without domains With domains Improvement art 0.667 0.778 0.111 authority 0.600 0.700 0.100 bar 0.625 0.615 -0.010 bum 0.865 0.919 0.054 chair 0.898 0.898 channel 0.567 0.597 0.030 child 0.661 0.695 0.034 church 0.560 0.600 0.040 circuit 0.408 0.388 -0.020 day 0.676 0.669 -0.007 detention 0.909 0.909 dyke 0.800 0.800 facility 0.429 0.500 0.071 fatigue 0.850 0.850 feeling 0.708 0.688 -0.021 grip 0.540 0.620 0.080 hearth 0.759 0.793 0.034 holiday 1.000 0.957 -0.043 lady 0.900 0.900 material 0.534 0.552 0.017 mouth 0.569 0.588 0.020 nation 0.720 0.720 nature 0.459 0.459 post 0.463 0.512 0.049 restraint 0.516 0.452 -0.065 sense 0.676 0.622 -0.054 spade 0.765 0.882 0.118 stress 0.378 0.378 yew 0.792 0.792 All 0.649 0.669 0.020 Table 16: Precision Results Using ME to Disambiguate Words, With and Without Domains (recall and precision values are equal) We obtained important conclusions about the relevance of domain information for each word. In general, the larger improvements appear for those words having well-diï¬erentiateddomains (spade, authority). Conversely, the word stress with most senses belonging to theFACTOTUM domain do not improves at all. For example, for spade, art and authority(with an accuracy improvement over 10%) domain data seems to be an important source ofknowledge with information that is not captured by other types of features. For those wordsfor which precision decrease up to 6.5%, domain information is confusing. Three reasonscan be exposed in order to explain this behavior: there is not a clear domain in the examplesor they do not represent correctly the context, domains do not diï¬erentiate appropriatelythe senses, or the number of training examples is too low to perform a valid assessment. Across-validation testing, if more examples were available, could be appropriate to performa domain tuning for each word in order to determine which words must use this preprocessand which not. Nevertheless, the experiment empirically demonstrates that a knowledge-based method, such as the domain heuristic, can be integrated successfully into a corpus-based system,such as maximum entropy, to obtain a small improvement. 4.3 Combining Results with(in) a Voting System In previous sections, we have demonstrated that it is possible to integrate two diï¬erent WSDapproaches. In this section we evaluate the performance when combining a knowledge-basedsystem, such as speciï¬cation marks, and a corpus-based system, such as maximum entropy,in a simple voting schema. 322
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods In the two previous experiments we attempted to provide more information by pre- disambiguating the data. Here, we use both methods in parallel and then we combine theirclassiï¬cations in a voting system, both for Senseval-2 Spanish and English lexicalâsampletasks. 4.3.1 Senseval-2 Spanish lexicalâsample task vME+SM is an enrichment of vME: we added the SM classiï¬er to the combination of thethree ME systems in vME (see Section 3.3). The results on the Spanish lexicalâsample taskfrom Senseval-2 are shown in Table 17. Because it only works with nouns, vME+SMimproves accuracy for them only, but obtains the same score as JHU(R) while the overallscore reaches the second place. ALL Nouns 0.713 jhu(R) 0.702 jhu(R) 0.684 vME+SM 0.702 vME+SM 0.682 jhu 0.683 MEbfs.pos 0.677 MEbfs.pos 0.681 jhu0.676 vME 0.678 vME 0.670 css244 0.661 MEbfs 0.667 MEbfs 0.652 css244 0.658 MEï¬x 0.646 MEï¬x 0.627 umd-sst 0.621 duluth 8 0.617 duluth 8 0.612 duluth Z 0.610 duluth 10 0.611 duluth 10 0.595 duluth Z 0.603 umd-sst 0.595 duluth 7 0.592 duluth 6 0.582 duluth 6 0.590 duluth 7 0.578 duluth X 0.586 duluth X 0.560 duluth 9 0.557 duluth 9 0.548 ua 0.514 duluth Y 0.524 duluth Y 0.464 ua Table 17: vME+SM in the Spanish lexicalâsample task of Senseval-2 These results show that methods like SM and ME can be combined in order to achieve good disambiguation results. Our results are in line with those of Pedersen (2002), whichalso presents a comparative evaluation between the systems that participated in the Spanishand English lexical-sample tasks of Senseval-2. Their focus is on pair comparisons betweensystems to assess the degree to which they agree, and on measuring the diï¬culty of the testinstances included in these tasks. If several systems are largely in agreement, then thereis little beneï¬t in combining them since they are redundant and they will simply reinforceeach other. However, if some systems disambiguate instances that others do not, then thesystems are complementary and it may be possible to combine them to take advantage ofthe diï¬erent strengths of each system to improve overall accuracy. The results for nouns (only applying SM), shown in Table 18, indicate that SM has a low level of agreement with all the other methods. However, the measure of optimalcombination is quite high, reaching 89% (1.00â0.11) for the pairing of SM and JHU. In 323
Montoyo, Su´ arez, Rigau, & Palomar fact, all seven of the other methods achieved their highest optimal combination value whenpaired with the SM method. System pair for nouns Both OKa One OK b Zero OK c Kappa d SM and JHU 0.29 0.32 0.11 0.06 SM and Duluth7 0.27 0.34 0.12 0.03 SM and DuluthY 0.25 0.35 0.12 0.01 SM and Duluth8 0.28 0.32 0.13 0.08 SM and Cs224 0.28 0.32 0.13 0.09 SM and Umcp 0.26 0.33 0.14 0.06 SM and Duluth9 0.26 0.31 0.16 0.14 Table 18: Optimal combination between the systems that participated in the Spanish lexicalâsample tasks of Senseval-2 a. Percentage of instances where both systems answers were correct. b. Percentage of instances where only one answer is correct.c. Percentage of instances where none of both answers is correct. d. The kappa statistic (Cohen, 1960) is a measure of agreement between multiple systems (or judges) that is scaled by the agreement that would be expected just by chance. A value of 1.00 suggests completeagreement, while 0.00 indicates pure chance agreement. This combination of circumstances suggests that SM, being a knowledge-based method, is fundamentally diï¬erent from the others (i.e., corpus-based) methods, and is able todisambiguate a certain set of instances where the other methods fail. In fact, SM is diï¬erentin that it is the only method that uses the structure of WordNet. 4.3.2 Senseval-2 English lexicalâsample task The same experiment was done on Senseval-2 English lexicalâsample task data and theresults are shown in Table 19. The details of how the diï¬erent systems were built can beconsulted in Section 3.2 Again, we can see in Table 19 that BFS per POS is better than per word, mainly because the same reasons explained in Section 3.3. Nevertheless, the improvement on nouns by using the vME+SM system is not as high as for the Spanish data. The diï¬erences between both corpora have a signiï¬cant relevanceabout the precision values that can be obtained. For example, the English data includesmulti-words and the sense inventory is extracted from WordNet, while the Spanish datais smaller and a dictionary was built for the task speciï¬cally, having a smaller polysemydegree. The results of vME+SM are comparable to the systems presented at Senseval-2 where the best system (Johns Hopkins University) reported 64.2% precision (68.2%, 58.5%and 73.9% for nouns, verbs and adjectives, respectively). Comparing these results with those obtained in section 4.2, we also see that using a voting system with the best feature selection for ME and Speciï¬cation Marks vME+SM,and using a nonâoptimized ME with the relevant domain heuristic, we obtain very similarperformance. That is, it seems that we obtain comparable performance combining diï¬erent 324
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods All Nouns Verbs Adjectives MEï¬x 0.601 0.656 0.519 0.696 MEbfs 0.606 0.658 0.519 0.696 MEbfs.pos 0.609 0.664 0.519 0.696 vME+SM 0.619 0.667 0.535 0.707 MEï¬x: 0mcbWsdrvK3 for all words MEbfs: each word with its best feature selection MEbfs.pos: 0Wsrdm for nouns, 0sbcprdmK10 for verbs,and 0mcbWsdrvK3 for adjectives vME+SM: majority voting between MEï¬x, MEbfs.pos, MEbfs, and Speciï¬cation Marks Table 19: Precision Results Using Best Feature Selection for ME and Speciï¬cation Marks on Senseval-2 English lexicalâsample task data classiï¬ers resulting from a feature selection process or using a richer set of features (addingthe domain features) with much less computational overhead. This analysis of the results from the Senseval-2 English and Spanish lexicalâsample tasks demonstrates that knowledge-based and corpus-based WSD systems can cooperateand can be combined to obtain improved WSD systems. The results empirically demon-strate that the combination of both approaches outperforms each of them individually,demonstrating that both approaches could be considered complementary. 5. Conclusions The main hypothesis of this work is that WSD requires diï¬erent kinds of knowledge sources(linguistic information, statistical information, structural information, etc.) and techniques.The aim of this paper was to explore some methods of collaboration between complementaryknowledge-based and corpus-based WSD methods. Two complementary methods have beenpresented: speciï¬cation marks (SM) and maximum entropy (ME). Individually, both havebeneï¬ts and drawbacks. We have shown that both methods can collaborate to obtain betterresults on WSD. In order to demonstrate our hypothesis, three diï¬erent schemes for combining both approaches have been presented. We have presented diï¬erent mechanisms of combininginformation sources around knowledge-based and corpus-based WSD methods. We havealso shown that the combination of both approaches outperforms each of the methods indi-vidually, demonstrating that both approaches could be considered complementary. Finally,we have shown that a knowledge-based method can help a corpus-based method to betterperform the disambiguation process, and vice versa. In order to help the speciï¬cation marks method, ME disambiguates some nouns in the context of the target word. ME selects these nouns by means of a previous analysis oftraining data in order to identify which ones seem to be highly accurately disambiguated. 325
Montoyo, Su´ arez, Rigau, & Palomar This preprocess ï¬xes some nouns reducing the search space of the knowledge-based method.In turn, ME is helped by SM by providing domain information of nouns in the contexts.This information is incorporated into the learning process in the form of features. By comparing the accuracy of both methods, with and without the contribution of the other, it was demonstrated that such combining schemes of WSD methods are possible andsuccessful. Finally, we presented a voting system for nouns that included four classiï¬ers, three of them based on ME, and one of them based on SM. This cooperation scheme obtained thebest score for nouns when compared with the systems submitted to the Senseval-2 Spanishlexicalâsample task and comparable results to those submitted to the Senseval-2 Englishlexicalâsample task. We are presently studying possible improvements in the collaboration between these methods, both by extending the information that the two methods provide to each otherand by taking advantage of the merits of each one. Acknowledgments The authors wish to thank the anonymous reviewers of the Journal of Artiï¬cial Intelli-gence Research and COLING 2002, the 19th International Conference on ComputationalLinguistics, for helpful comments on earlier drafts of the paper. An earlier paper (Su´arez &Palomar, 2002b) about the corpus-based method (subsection 3.2) was presented at COLING2002. This research has been partially funded by the Spanish Government under project CI- CyT number TIC2000-0664-C02-02 and PROFIT number FIT-340100-2004-14 and the Va-lencia Government under project number GV04B-276 and the EU funded project MEAN-ING (IST-2001-34460). References Agirre, E., & Martinez, D. (2001a). Decision lists for english and basque. In Proceedings of the SENSEVAL-2 Workshop. In conjunction with ACLâ2001/EACLâ2001 Toulouse,France. Agirre, E., & Martinez, D. (2001b). Knowledge sources for word sense disambiguation. In Proceedings of International Conference on Text, Speech and Dialogue (TSDâ2001)Selezna Ruda, Czech Republic. Agirre, E., & Rigau, G. (1996). Word Sense Disambiguation using Conceptual Density. In Proceedings of the 16th International Conference on Computational Linguistic (COL-ING´96 Copenhagen, Denmark. Berger, A. L., Pietra, S. A. D., & Pietra, V. J. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22 (1), 39â71. 326
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Bruce, R., & Wiebe, J. (1994). Word sense disambiguation using decomposable models. In Proceedings of the 32nd Annual Meeting of the Association for Computational Lin-guistics (ACLâ1994), pp. 139â145 Las Cruces, US. Cabezas, C., Resnik, P., & Stevens, J. (2001). Supervised Sense Tagging using Support Vector Machines. In Proceedings of the Second International Workshop on EvaluatingWord Sense Disambiguation Systems (SENSEVAL-2) Toulouse, France. Cardie, C., & Mooney, R. J. (1999). Guest editorsâ introduction: Machine learning and natural language. Machine Learning, 34 (1-3), 5â9. Cohen, J. (1960). A coeï¬cient of agreement for nominal scales. Educ. Psychol. Meas., 20, 37â46. Cowie, J., Guthrie, J., & Guthrie, L. (1992). Lexical disambiguation using simulated anneal- ing. In Proceedings of the 14th International Conference on Computational Linguistic,COLING´92, pp. 359â365 Nantes, France. Cuadros, M., Atserias, J., Castillo, M., & Rigau, G. (2004). Automatic acquisition of sense examples using exretriever. In IBERAMIA Workshop on Lexical Resources and TheWeb for Word Sense Disambiguation. Puebla, Mexico. Dang, H. T., yi Chia, C., Palmer, M., & Chiou, F.-D. (2002). Simple features for chinese word sense disambiguation. In Chen, H.-H., & Lin, C.-Y. (Eds.), Proceedings of the19th International Conference on Computational Linguistics (COLING´2002). Dietterich, T. G. (1998). Approximate statistical test for comparing supervised classiï¬cation learning algorithms. Neural Computation, 10 (7), 1895â1923. Escudero, G., M`arquez, L., & Rigau, G. (2000). Boosting applied to word sense disam- biguation. In Proceedings of the 12th Conference on Machine Learning ECML2000Barcelona, Spain. Fellbaum, C. (Ed.). (1998). WordNet. An Electronic Lexical Database. The MIT Press. Fern´ andez-Amor´ os, D., Gonzalo, J., & Verdejo, F. (2001). The Role of Conceptual Rela- tions in Word Sense Disambiguation. In Proceedings 6th International Conference onApplication of Natural Language to Information Systems (NLDB´2001)., pp. 87â98Madrid, Spain. Hoste, V., Daelemans, W., Hendrickx, I., & van den Bosch, A. (2002). Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation.In Proceedings of the ACLâ2002 Workshop on Word Sense Disambiguation: RecentSuccesses and Future Directions, pp. 95â101 PA, USA. Ide, N., & V´eronis, J. (1998). Introduction to the Special Issue on Word Sense Disambigua- tion: The State of the Art. Computational Linguistics, 24 (1), 1â40. Leackock, C., Chodorow, M., & Miller, G. (1998). Using corpus statistics and wordnet relations for sense identiï¬cation. Computational Linguistics. Special Issue on WSD,24 (1). 327
Montoyo, Su´ arez, Rigau, & Palomar Lesk, M. (1986). Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 1986 SIGDOCConference, Association for Computing Machinery, pp. 24â26 Toronto, Canada. Magnini, B., & Strapparava, C. (2000). Experiments in Word Domain Disambiguation for Parallel Texts. In Proceedings of the ACL Workshop on Word Senses and Multilin-guality Hong Kong, China. Magnini, B., Strapparava, C., Pezzulo, G., & Gliozzo, A. (2002). The Role of Domain Information in Word Sense Disambiguation. Natural Language Engineering, 8 (4),359â373. Manning, C., & Sch¨ utze, H. (Eds.). (1999). Foundations of Statistical Natural Language Processing. The MIT Press. Manning, C. D., & Sch¨ utze, H. (1999). Foundations of Statistical Natural Language Pro- cessing. The MIT Press, Cambridge, Massachusetts. McRoy, S. W. (1992). Using multiple knowledge sources for word sense discrimination. Computational Linguistics, 18 (1), 1â30. Mihalcea, R. (2002). Instance based learning with automatic feature selection applied to word sense disambiguation. In Chen, H.-H., & Lin, C.-Y. (Eds.), Proceedings of the19th International Conference on Computational Linguistics (COLING´2002). Mihalcea, R., & Moldovan, D. (1999). A Method for word sense disambiguation of un- restricted text. In Proceedings of the 37th Annual Meeting of the Association forComputational Linguistic, ACLâ99, pp. 152â158 Maryland, Usa. Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38 (11), 39â41. Miller, G. A., Leacock, C., Tengi, R., & Bunker, T. (1993). A Semantic Concordance. In Proceedings of ARPA Workshop on Human Language Technology, pp. 303â308Plainsboro, New Jersey. Mitchell, T. M. (Ed.). (1997). Machine Learning. McGraw Hill. Montoyo, A., & Palomar, M. (2001). Speciï¬cation Marks for Word Sense Disambiguation: New Development. In Gelbukh, A. F. (Ed.), CICLing, Vol. 2004 of Lecture Notes inComputer Science, pp. 182â191. Springer. Montoyo, A., & Su´arez, A. (2001). The University of Alicante word sense disambiguation system.. In Preiss, & Yarowsky (Preiss & Yarowsky, 2001), pp. 131â134. Montoyo, A., Palomar, M., & Rigau, G. (2001). WordNet Enrichment with Classiï¬cation Systems.. In Proceedings of WordNet and Other Lexical Resources: Applications,Extensions and Customisations Workshop. (NAACL-01) The Second Meeting of theNorth American Chapter of the Association for Computational Linguistics, pp. 101â106 Carnegie Mellon University. Pittsburgh, PA, USA. 328
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods Montoyo, A., Vazquez, S., & Rigau, G. (2003). M´etodo de desambiguaci´on l´exica basada en el recurso l´exico Dominios Relevantes. Procesamiento del Lenguaje Natural, 31,141â148. Ng, H. (1997). Exemplar-Base Word Sense Disambiguation: Some Recent Improvements. In Proceedings of the 2nd Conference on Empirical Methods in Natural LanguageProcessing, EMNLP. Ng, H. T., & Lee, H. B. (1996). Integrating multiple knowledge sources to disambiguate word senses: An exemplar-based approach. In Joshi, A., & Palmer, M. (Eds.), Proceedingsof the 34th Annual Meeting of the Association for Computational Linguistics SanFrancisco. Morgan Kaufmann Publishers. Pedersen, T. (2002). Assessing System Agreement and Instance Diï¬culty in the Lexical Sample Tasks of Senseval-2. In Proceedings of the Workshop on Word Sense Disam-biguation: Recent Successes and Future Directions, ACL´2002 Philadelphia, USA. Preiss, J., & Yarowsky, D. (Eds.). (2001). Proceedings of SENSEVAL-2, Toulouse, France. ACL-SIGLEX. Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Reso- lution. Ph.D. thesis, University of Pennsylvania. Rigau, G., Agirre, E., & Atserias, J. (1997). Combining unsupervised lexical knowledge methods for word sense disambiguation. In Proceedings of joint 35th Annual Meetingof the Association for Computational Linguistics and 8th Conference of the EuropeanChapter of the Association for Computational Linguistics ACL/EACLâ97 Madrid,Spain. Rigau, G., Taul´e, M., Fern´ andez, A., & Gonzalo, J. (2001). Framework and results for the spanish senseval.. In Preiss, & Yarowsky (Preiss & Yarowsky, 2001), pp. 41â44. Snedecor, G. W., & Cochran, W. G. (1989). Statistical Methods (8 edition). Iowa State University Press, Ames, IA. Snyder, B., & Palmer, M. (2004). The english all-words task. In Proceedings of the 3rd ACL workshop on the Evaluation of Systems for the Semantic Analysis of Text(SENSEVAL-3). Barcelona, Spain. Su´ arez, A., & Palomar, M. (2002a). Feature selection analysis for maximum entropy-based wsd. In Gelbukh, A. F. (Ed.), CICLing, Vol. 2276 of Lecture Notes in ComputerScience, pp. 146â155. Springer. Su´arez, A., & Palomar, M. (2002b). A maximum entropy-based word sense disambiguation system. In Chen, H.-H., & Lin, C.-Y. (Eds.), Proceedings of the 19th InternationalConference on Computational Linguistics (COLING´2002), pp. 960â966. Sussna, M. (1993). Word sense disamiguation for free-text indexing using a massive semantic network. . In Proceedings of the Second International Conference on Information andKnowledge Base Management, CIKM´93, pp. 67â74 Arlington, VA. 329
Montoyo, Su´ arez, Rigau, & Palomar Tapanainen, P., & J¨arvinen, T. (1997). A non-projective dependency parser. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 64â71. Towell, G. G., & Voorhees, E. M. (1998). Disambiguating highly ambiguous words. Com- putational Linguistics, 24 (1), 125â145. Veenstra, J., den Bosch, A. V., Buchholz, S., Daelemans, W., & Zavrel, J. (2000). Memory- based word sense disambiguation. Computers and the Humanities, Special Issue onSENSEVAL, 34 (1â2), 171â177. Wilks, Y., Fass, D., Guo, C.-M., McDonald, J., Plate, T., & Slator, B. (1993). Provid- ing machine tractable dictionary tools. In Pustejovsky, J. (Ed.), Semantics and thelexicon, pp. 341â401. Kluwer Academic Publishers. Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in spanish and french.. In In Proceedings of the 32nd Annual Meeting ofthe Association for Computational Linguistics (ACLâ1994) Las Cruces, NM,. 330