Re: [greenstone-users] Missing documents

From sw64@cs.waikato.ac.nz
DateThu, 4 Jan 2007 14:44:55 +1300 (NZDT)
Subject Re: [greenstone-users] Missing documents
In-Reply-To (7-0-1-0-2-20070103183316-022c63a0-free-fr)
Hello John,

> When I built the collection, I found that two of
> the pdf documents were rejected (see below), and
> the others seemed to be processed normally. I
> believe that searching worked for the processed
> documents, but when I tried to display them in
> browsing classifiers, those with filenames of
> more than 36 characters (but which were handled
> without problems by Windows) would not display
> (at least with the default VList). When I
> shortened the filenames and tried again, I found
> that the documents with filenames with French
> accented characters would not display with the
> browsing classifiers (although they apparently
> did display when found by search). When I took
> out the accents, all 14 are displayed normally.
> Is this a bug or is there a way to get around it?

Thanks for pointing it out and I'll look into it.

> Concerning the two rejected pdf documents, one
> consists only of images, whereas the other seems
> to have been prepared in some sort of secure mode
> (for example, one cannot cut and paste from
> selected text). [Another pdf document consisting
> of images, but with an active table of contents,
> was processed normally.] Does anyone know how I
> could easily get the missed files into the
> collection with their metadata,

You can use UnknownPlug to get them into the collection.
Or you can add another PDFPlug with "convert to" option to convert the
rejected pdfs to images, for more detail about this, please see
http://greenstone.sourceforge.net/wiki/gsdoc/tutorial/en/enhanced_pdf.htm
However, in either way, only limited metatada (no title, author etc.) can
be automatically extracted-- you need to add more manually either in the
Enrich pane or the metadata.xml file.

> and if possible
> with the text of the rejected textual pdf file?
Sorry, Greenstone uses the third-party software to extract text, there is
nothing it can do.

Regards
Shaoqun