text mining in a digital library Re: [greenstone-users] Sharing GSDL research

From Karen E. Medina
DateSun, 26 Sep 2004 20:25:53 -0500
Subject text mining in a digital library Re: [greenstone-users] Sharing GSDL research
In-Reply-To (5-1-1-6-0-20040926185250-00b2e470-cwpanama-net)
(I'm not associated with Greenstone), but the paper sounded interesting.=20
There is no abstract for the paper, but here is the Introduction and the
full citation in case you wanted to get the article through Interlibrary
Loan. (you had the title of the journal slightly wrong)

Ian H. Witten, Katherine J. Don, Michael Dewsnip, & Valentin Tablan.
(2004). Text mining in a digital library. International Journal on Digital
Libraries. Volume 4, number 1. August 2004. pp. 56-59. Publisher:
Springer-Verlag Heidelberg. ISSN: 1432-5012 (Paper) 1432-1300 (Online).
DOI: 10.1007/s00799-003-0066-4=20

1 Introduction
Digital librarians strive to add value to the collections
they create and maintain. One way is through selectivity:
a carefully chosen set of authoritative documents in
a particular topic area is far more useful to those working
in the area than a huge, unfocused collection (like the
Web). Another is by augmenting the collection with highquality
metadata, which supports activities of searching
and browsing in a uniform and useful way. A third way,
and our topic here, is to enrich the documents by examining
their content, extracting information, and using it to
enhance the ways they can be located and presented.
Text mining is a burgeoning new field that attempts
to glean meaningful information from natural-language
text. It may be loosely characterized as the process
of analyzing text to extract information that is useful
for particular purposes. It most commonly targets text
whose function is the communication of factual information
or opinions, and the motivation for trying to extract
information from such text automatically is compelling
even if success is only partial. "Text mining" (sometimes
called "text data mining" [4]) defies tight definition but
encompasses a wide range of activities: text summarization;
document retrieval; document clustering; text categorization;
language identification; authorship ascription;
identifying phrases, phrase structures, and key phrases;
extracting "entities" such as names, dates, and abbreviations;
locating acronyms and their definitions; filling
predefined templates with extracted information; and
even learning rules from such templates [8].
Techniques of text mining have much to offer digital
libraries and their users. Here we describe the marriage
of a widely used digital library system (Greenstone)
with a development environment for text mining (GATE)
to enrich the library reader=92s experience. The work is in
progress: one level of integration has been demonstrated
and another is planned. The project has been greatly facilitated
by the fact that both systems are publicly available
under the GNU public license and, in addition, this
means that the benefits gained by leveraging text mining
techniques will accrue to all Greenstone users.

