The original source documents are available from the multi-lingual HTML project.
Greenstone automatically attempts to extract the language (and encoding too) of each source document at collection build time. To do this it uses a modified version of TextCat by Gertjan van Noord ([email protected]). You can learn more about TextCat, and download the source code, from http://odur.let.rug.nl/~vannoord/TextCat.
There are 3 ways to find information in this collection: