About this collection

This collection demonstrates Greenstone's language identification ability. It is based on a collection of Japanese folktales that have been translated into Chinese, English, French, German, Italian, Japanese and Spanish (note that Greenstone also supports many other languages).

The original source documents are available from the multi-lingual HTML project.

Greenstone automatically attempts to extract the language (and encoding too) of each source document at collection build time. To do this it uses a modified version of TextCat by Gertjan van Noord ([email protected]). You can learn more about TextCat, and download the source code, from

How to find information in the folktales: language extraction demo collection

There are 3 ways to find information in this collection:

  • search for particular words that appear in the text by clicking the Search button
  • browse documents by Title by clicking the Titles button
  • browse documents by Language by clicking the Languages button