Re: Re: [greenstone-users] Missing documents

From sw64@cs.waikato.ac.nz
DateThu, 11 Jan 2007 09:56:47 +1300 (NZDT)
Subject Re: Re: [greenstone-users] Missing documents
In-Reply-To (7-0-1-0-2-20070109195120-0228feb8-alumni-caltech-edu)
Hello John,

Sorry for the late reply.

What PDFPlug with the "convert_to pagedimg_jpg" option does is convert the
pages of a pdf file into seperate images which are ultimately processed by
PagedImgPlug. If you go to the
"greenstone-home/collect/[collection]/archives/[HASHID]/, you will see all
converted images. However, no text are extracted by either plugins,notext
pdf can be viewed as sepreate images, instead.

Ok, back to your problem. I think you missed two things:
1. In the Search Indexes section, check the section checkbox to build
the indexes on section level (the converted images are treated as
sections)
2. Edit the DocumentText format statement to read as
{Or}{[srcicon],[Text]}
(display notext pdf as images, and text for others)

Regards
Shaoqun

> Dear Shaoqun,
>
> I have tried to follow the instructions on
> enhanced pdf handling. I put the two rejected pdf
> documents (one with only images, the other which
> does not authorize copy or extraction) in a
> sub-directory called Notext in the import
> directory of the collection. I added an
> additional PDFPlug before the normal one which
> shows in the collect.cfg file as:
>
> plugin PDFPlug -convert_to pagedimg_jpg -process_exp
> "Notext.*.pdf"
>
> I have both ImageMagick and Ghostscript installed.
>
> When I build the collection I don't get any
> content for these documents when I click on the
> icons for in the html files in the browsing
> display (just the document title). I noticed that
> in the execution list the PDFPlug is not
> mentioned, but rather only treatment of files in
> the tmp sub-directory by HTMLPlug, and the last two lines are:
>
> The file C:Program
> FilesGreenstonecollectsidavih mpFormationdesorganismesrelaisFormationdesorganismesrelais.item
> is being processed by PagedImgPlug.
> The file C:Program
> FilesGreenstonecollectsidavih mpFormationFAWECAMFormationFAWECAM.item
> is being processed by PagedImgPlug.
>
> Could you please advise? Thanks, John
>
>
>>Message: 7
>>Date: Thu, 4 Jan 2007 14:44:55 +1300 (NZDT)
>>From: sw64@cs.waikato.ac.nz
>>Subject: Re: [greenstone-users] Missing documents
>>To: "John Rose" <johnrose@alumni.caltech.edu>
>>Cc: greenstone-users@list.scms.waikato.ac.nz
>>Message-ID:
>> <44636.130.217.244.2.1167875095.squirrel@webmail.scms.waikato.ac.nz>
>>Content-Type: text/plain;charset=iso-8859-1
>>
>>Hello John,
>>
>> > When I built the collection, I found that two of
>> > the pdf documents were rejected (see below), and
>> > the others seemed to be processed normally. I
>> > believe that searching worked for the processed
>> > documents, but when I tried to display them in
>> > browsing classifiers, those with filenames of
>> > more than 36 characters (but which were handled
>> > without problems by Windows) would not display
>> > (at least with the default VList). When I
>> > shortened the filenames and tried again, I found
>> > that the documents with filenames with French
>> > accented characters would not display with the
>> > browsing classifiers (although they apparently
>> > did display when found by search). When I took
>> > out the accents, all 14 are displayed normally.
>> > Is this a bug or is there a way to get around it?
>>
>>Thanks for pointing it out and I'll look into it.
>>
>> > Concerning the two rejected pdf documents, one
>> > consists only of images, whereas the other seems
>> > to have been prepared in some sort of secure mode
>> > (for example, one cannot cut and paste from
>> > selected text). [Another pdf document consisting
>> > of images, but with an active table of contents,
>> > was processed normally.] Does anyone know how I
>> > could easily get the missed files into the
>> > collection with their metadata,
>>
>>You can use UnknownPlug to get them into the collection.
>>Or you can add another PDFPlug with "convert to" option to convert the
>>rejected pdfs to images, for more detail about this, please see
>>http://greenstone.sourceforge.net/wiki/gsdoc/tutorial/en/enhanced_pdf.htm
>>However, in either way, only limited metatada (no title, author etc.) can
>>be automatically extracted-- you need to add more manually either in the
>>Enrich pane or the metadata.xml file.
>>
>> > and if possible
>> > with the text of the rejected textual pdf file?
>>Sorry, Greenstone uses the third-party software to extract text, there is
>>nothing it can do.
>>
>>Regards
>>Shaoqun
>
>
> John B. Rose
> 1 Bis, Rue des Châtre-Sacs
> 92310 Sèvres
> France
> Email: <johnrose@alumni.caltech.edu>
>
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>
>