[greenstone-users] Using UTF-8 filenames when creating a collection

From George Giannakopoulos
DateMon Mar 3 23:55:00 2008
Subject [greenstone-users] Using UTF-8 filenames when creating a collection
Hello, all.

The problems:
a) I have a different number of texts in the aut
b) All files with non-latin original filenames, e.g. "?????????218.pdf"
(greek text), do not appear in the indexes (in the web interface), or
the search page.

The settings:
I have just created a GS collection based on a set of several hundred
PDF files, with filenames that are not Latin-encoded, but UTF8. Not all
files have been enriched by metadata.

The PDF files have no text, but are purely multi-page scanned images.
The documents have been annotated with metadata concerning the author,
title, subject and words and so forth, through the GLI enrich process.
The dc.* metadata have been used, and especially the Subject and
Keywords, Title, Creator fields.

As I only need the search and indices to point to the original file,
without any text indexing I have used the following Plugins (with the
presented order and parameters):
UnknownPlug -assoc_field Source -file_format PDF -mime_type
application/pdf -process_exp (?i).pdf
MetadataXMLPlug (and I have tried parameters "-input_encoding auto
-default_encoding utf8" as well)

For the import I have used the "gzip" option. For the build the
"no-text" option. In the last run, I selected verbosity to be 4 for both
operations (import and build) to make sure I can see everything that

The indices concern the fields: Subject and Keywords, Title, Creator and
The browsing categories concern the same fields.

I am using the expert mode of the GLI.
I can supply the build log if needed, but I have not done so here to
avoid redundancy.

Thank you a priori
George G.


George Giannakopoulos
PhD Student
Software & Knowledge Engineering Laboratory
Institute of Informatics & Telecommunications
National Center Of Scientific Research "Demokritos"
E-mail: ggianna@iit.demokritos.gr