a) I have a different number of texts in the aut
b) All files with non-latin original filenames, e.g. "?????????218.pdf"
(greek text), do not appear in the indexes (in the web interface), or
the search page.
I have just created a GS collection based on a set of several hundred
PDF files, with filenames that are not Latin-encoded, but UTF8. Not all
files have been enriched by metadata.
The PDF files have no text, but are purely multi-page scanned images.
The documents have been annotated with metadata concerning the author,
title, subject and words and so forth, through the GLI enrich process.
The dc.* metadata have been used, and especially the Subject and
Keywords, Title, Creator fields.
As I only need the search and indices to point to the original file,
without any text indexing I have used the following Plugins (with the
presented order and parameters):
UnknownPlug -assoc_field Source -file_format PDF -mime_type
application/pdf -process_exp (?i).pdf
MetadataXMLPlug (and I have tried parameters "-input_encoding auto
-default_encoding utf8" as well)
For the import I have used the "gzip" option. For the build the
"no-text" option. In the last run, I selected verbosity to be 4 for both
operations (import and build) to make sure I can see everything that
The indices concern the fields: Subject and Keywords, Title, Creator and
The browsing categories concern the same fields.
I am using the expert mode of the GLI.
I can supply the build log if needed, but I have not done so here to
Thank you a priori
Software & Knowledge Engineering Laboratory
Institute of Informatics & Telecommunications
National Center Of Scientific Research "Demokritos"