Re: Re: [greenstone-users] Missing documents

From John Rose
DateTue, 09 Jan 2007 22:04:03 +0100
Subject Re: Re: [greenstone-users] Missing documents
In-Reply-To (20070104023021-C5E403F232A-mail-alumni-caltech-edu)
Dear Shaoqun,

I have tried to follow the instructions on
enhanced pdf handling. I put the two rejected pdf
documents (one with only images, the other which
does not authorize copy or extraction) in a
sub-directory called Notext in the import
directory of the collection. I added an
additional PDFPlug before the normal one which
shows in the collect.cfg file as:

plugin PDFPlug -convert_to pagedimg_jpg -process_exp "Notext.*.pdf"

I have both ImageMagick and Ghostscript installed.

When I build the collection I don't get any
content for these documents when I click on the
icons for in the html files in the browsing
display (just the document title). I noticed that
in the execution list the PDFPlug is not
mentioned, but rather only treatment of files in
the tmp sub-directory by HTMLPlug, and the last two lines are:

The file C:Program
FilesGreenstonecollectsidavih mpFormationdesorganismesrelaisFormationdesorganismesrelais.item
is being processed by PagedImgPlug.
The file C:Program
FilesGreenstonecollectsidavih mpFormationFAWECAMFormationFAWECAM.item
is being processed by PagedImgPlug.

Could you please advise? Thanks, John


>Message: 7
>Date: Thu, 4 Jan 2007 14:44:55 +1300 (NZDT)
>From: sw64@cs.waikato.ac.nz
>Subject: Re: [greenstone-users] Missing documents
>To: "John Rose" <johnrose@alumni.caltech.edu>
>Cc: greenstone-users@list.scms.waikato.ac.nz
>Message-ID:
> <44636.130.217.244.2.1167875095.squirrel@webmail.scms.waikato.ac.nz>
>Content-Type: text/plain;charset=iso-8859-1
>
>Hello John,
>
> > When I built the collection, I found that two of
> > the pdf documents were rejected (see below), and
> > the others seemed to be processed normally. I
> > believe that searching worked for the processed
> > documents, but when I tried to display them in
> > browsing classifiers, those with filenames of
> > more than 36 characters (but which were handled
> > without problems by Windows) would not display
> > (at least with the default VList). When I
> > shortened the filenames and tried again, I found
> > that the documents with filenames with French
> > accented characters would not display with the
> > browsing classifiers (although they apparently
> > did display when found by search). When I took
> > out the accents, all 14 are displayed normally.
> > Is this a bug or is there a way to get around it?
>
>Thanks for pointing it out and I'll look into it.
>
> > Concerning the two rejected pdf documents, one
> > consists only of images, whereas the other seems
> > to have been prepared in some sort of secure mode
> > (for example, one cannot cut and paste from
> > selected text). [Another pdf document consisting
> > of images, but with an active table of contents,
> > was processed normally.] Does anyone know how I
> > could easily get the missed files into the
> > collection with their metadata,
>
>You can use UnknownPlug to get them into the collection.
>Or you can add another PDFPlug with "convert to" option to convert the
>rejected pdfs to images, for more detail about this, please see
>http://greenstone.sourceforge.net/wiki/gsdoc/tutorial/en/enhanced_pdf.htm
>However, in either way, only limited metatada (no title, author etc.) can
>be automatically extracted-- you need to add more manually either in the
>Enrich pane or the metadata.xml file.
>
> > and if possible
> > with the text of the rejected textual pdf file?
>Sorry, Greenstone uses the third-party software to extract text, there is
>nothing it can do.
>
>Regards
>Shaoqun


John B. Rose
1 Bis, Rue des Châtre-Sacs
92310 Sèvres
France
Email: <johnrose@alumni.caltech.edu>