Hello, all.
The problems:
-------------
a) I have a different number of texts in the aut
b) All files with non-latin original filenames, e.g. "?????????218.pdf"
(greek text), do not appear in the indexes (in the web interface), or
the search page.
The settings:
-------------
I have just created a GS collection based on a set of several hundred
PDF files, with filenames that are not Latin-encoded, but UTF8. Not all
files have been enriched by metadata.
The PDF files have no text, but are purely multi-page scanned images.
The documents have been annotated with metadata concerning the author,
title, subject and words and so forth, through the GLI enrich process.
The dc.* metadata have been used, and especially the Subject and
Keywords, Title, Creator fields.
As I only need the search and indices to point to the original file,
without any text indexing I have used the following Plugins (with the
presented order and parameters):
ZIPPlug
UnknownPlug -assoc_field Source -file_format PDF -mime_type
application/pdf -process_exp (?i).pdf
GAPlug
MetadataXMLPlug (and I have tried parameters "-input_encoding auto
-default_encoding utf8" as well)
For the import I have used the "gzip" option. For the build the
"no-text" option. In the last run, I selected verbosity to be 4 for both
operations (import and build) to make sure I can see everything that
happens.
The indices concern the fields: Subject and Keywords, Title, Creator and
Date.
The browsing categories concern the same fields.
I am using the expert mode of the GLI.
I can supply the build log if needed, but I have not done so here to
avoid redundancy.
Thank you a priori
George G.
--
=======================================================
George Giannakopoulos
PhD Student
Software & Knowledge Engineering Laboratory
Institute of Informatics & Telecommunications
National Center Of Scientific Research "Demokritos"
E-mail: ggianna@iit.demokritos.gr
======================================================= |