View the PDF document Text mining: a new frontier for lossless compression

Witten, I. H., Bray, Z., Mahoui, M., Teahan, W. J. (1999) Proc Data Compression Conference,198-207, IEEE Press, Los Alamitos, CA.

Data mining, a burgeoning new technology, is about looking for patterns in data. Likewise, text mining is about looking for patterns in text. It may be defined as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with. Nevertheless, in modern Western culture, text is the most common vehicle for the formal exchange of information. The motivation for trying to extract information from it is compelling-even if success is only partial. This paper aims to promote text compression as a key technology for text mining.