New Zealand Digital Library Project members have developed a range of
practical software packages in the course of their research. Much of this
software is available for download.
Digital libraries and indexing
- Greenstone is the digital
library system that generates most of the pages of this website. It is freely
available under the GNU General public license, and has been adopted by
numerous other projects. It is used to disseminate information by
humanitarian organisations including Global Help Projects and United
Nations organisations.
- Our website hosts exotic collections, humanitarian collections, and reference collections.
- Other websites mirror these collections, and host many others.
- Greenstone is available for download.
- MG is an enhancement of the Managing Gigabytes full-text retrieval
system that provides flexible stemming methods, weighting terms, term
frequencies, merged indexes, machine independent indexes, and a port to
MSDOS.
- PreScript converts PostScript to plain
ASCII or HTML. It detects paragraph boundaries, removes hyphenation, and
interprets many ligatures.
Extracting data and metadata
-
Sequitur is a method for
inferring compositional hierarchies from strings by detecting repetition
and factoring it out of the string by forming rules in a grammar. Sequitur
is useful for recognizing lexical structure in strings, and excels at very
long sequences.
- Kea is a program for
automatically extracting keywords and keyphrases from the full text of
documents. Candidate keyphrases are identified using rudimentary lexical
processing, features are computed for each candidate, and machine learning
is used to determines which candidates should be assigned as keyphrases.
Text Mining
Browsing interfaces
-
The 3D Book Visualizer is a suite of programs for creating and interacting with an interactive three-dimensional simulation of a paper-based book.
It supports these interactive features:
-
Spinning the book around
-
Zooming in and out
-
Turning a single page or a wodge of pages
-
Flipping through key pages
-
Switching between handling mode and reading mode.
It supports the PDF and DjVu document formats.
-
Phind is an interface for browsing
the phrases that occur in a collection. The phrases form an approximation
of the topics covered. They are extracted from the noun-phrases occuring
in the text, so nonsense phrases and phrases with very little information
content are excluded. Each phrase is part of a hierarchy, and the user can
browse more specialised topics, or retrieve documents that contain the
phrase, at any point.
-
The collage applet
dynamically displays a given set of images. When an image is
clicked, a new browser window opens and the associated URL is displayed.
The applet can be used in two different contexts: either within the Greenstone
Digital Library Software or externally using a directory of images and associated links.
Word segmentation
-
Word segmentation is
designed to find word boundaries in languages like Chinese and Japanese,
which are (unlike English) written without spaces or other word delimiters
(except for punctuation marks). It plays a significant role in
applications that use the word as the basic unit due to the fact that
machine-readable Chinese text is invariably stored in unsegmented form.
- We have implemented a WWW
interface for segmanting Chinese text.
- If your web browsers does not support Chinese text, illustrations of
the transformation are available.
Others
New Zealand Digital Library Project
Department of Computer Science,
University of Waikato,
New Zealand
July 2000
|