1 Using language models for generic entity extraction Ian H. Witten, Zane Bray, Malika Mahoui, W.J. TeahanComputer Science University of WaikatoHamilton, New Zealand ihw@cs.waikato.ac.nz Abstract This paper describes the use of statisticallanguage modeling techniques, such as are commonly used for text compression, to extractmeaningful, low-level, information about the location of semantic tokens, or "entities," in text.We begin by marking up several different token types in training documents--for example,people's names, dates and time periods, phone numbers, and sums of money. We form alanguage model for each token type and examine how accurately it identifies new tokens. We thenapply a search algorithm to insert token boundaries in a way that maximizes compressionof the entire test document. The technique can be applied to hierarchically-defined tokens, leadingto a kind of "soft parsing" that will, we believe, be able to identify structured items such asreferences and tables in html or plain text, based on nothing more than a few marked-up examplesin training documents. 1. INTRODUCTION Text mining is about looking for patterns in text, and maybe defined as the process of analyzing text to extract information that is useful for particular purposes.Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with.Nevertheless, in modern Western culture, text is the most common vehicle for the formal exchange of information.The motivation for trying to extract information from it is compelling--even if success is only partial. Text mining is possible because you do not have tounderstand text in order to extract useful information from it. Here are four examples. First, if only names could beidentified, links could be inserted automatically to other places that mention the same name--links that are"dynamically evaluated" by calling upon a search engine to bind them at click time. Second, actions can beassociated with different types of data, using either explicit programming or programming-by-demonstrationtechniques. A day/time specification appearing anywhere within one's email could be associated with diary actionssuch as updating a personal organizer or creating an automatic reminder, and each mention of a day/time in thetext could raise a popup menu of calendar-based actions. Third, text could be mined for data in tabular format,allowing databases to be created from formatted tables such as stock-market information on Web pages. Fourth,an agent could monitor incoming newswire stories for company names and collect documents that mentionthem--an automated press clipping service. In all these examples, the key problem is to recognizedifferent types of target fragments, which we will call tokens or "entities". This is really a kind of languagerecognition problem: we have a text made up of different sublanguages (for personal names, company names, dates,table entries, and so on) and seek to determine which parts are expressed in which language. The information extraction research community (of whichwe were, until recently, unaware) has studied these tasks and reported results at annual Message UnderstandingConferences (MUC). For example, "named entities" are defined as proper names and quantities of interest,including personal, organization, and location names, as well as dates, times, percentages, and monetary amounts(Chinchor, 1999). The standard approach to this problem is manual:tokenizers and grammars are hand-designed for the particular data being extracted. Looking at currentcommercial state-of-the-art text mining software, for example, IBM's Intelligent Miner for Text (Tkach, 1997)uses specific recognition modules carefully programmed for the different data types, while Apple's data detectors(Nardi et al., 1998) uses language grammars. The Text Tokenization Tool of Grover et al . (1999) is anotherexample, and a demonstration version is available on the Web. The challenge for machine learning is to use