Re: Re: [greenstone-users] Missing documents

From sw64@cs.waikato.ac.nz
DateThu, 25 Jan 2007 10:27:03 +1300 (NZDT)
Subject Re: Re: [greenstone-users] Missing documents
In-Reply-To (7-0-1-0-2-20070111211013-02291400-free-fr)
Hello John,
It will be great if you can give me that pdf file, I'd like to try it here.

Regards
Shaoqun


> Dear Shaoqun,
>
> My problem was indeed the failure to check the
> section checkbox in the Search Indexes section
> (which I understand will not necessary in future
> versions unless you really want section level
> indexes, as opposed to document pages treated by Greenstone as sections).
>
> My two plugins to handle respectively the pdf
> files to be treated as images and those to be treated as text are:
>
> plugin PDFPlug -convert_to pagedimg_jpg -process_exp
> Notext.*.pdf
> plugin PDFPlug -convert_to html
>
> The two pdf files which were previously rejected
> by PDFPlug (one which was only images, and the
> other which was generated so as not to be
> copiable) were put in the Notext directory and
> yielded image displays as expected.
>
> However, I have another pdf file which is also
> only images, except that one of these is
> referenced as a table of contents (the file is
> 3.5 MB and will not go by email, but I could FTP
> it to you). When I try to treat this as image
> data by putting it in the Notext directory, I get
> a display with all blank pages (one for each page
> in the document). Is there anything I can do fix
> this (I guess it is because the file can be
> treated by the normal PDFPlug parameters, even
> though this does not yield any text extraction)?
>
> Thanks and best regards,
> John
>
>
> At 21:56 10/01/2007, you wrote:
>>Hello John,
>>
>>Sorry for the late reply.
>>
>>What PDFPlug with the "convert_to pagedimg_jpg" option does is convert
>> the
>>pages of a pdf file into seperate images which are ultimately processed
>> by
>>PagedImgPlug. If you go to the
>>"greenstone-home/collect/[collection]/archives/[HASHID]/, you will see
>> all
>> converted images. However, no text are extracted by either
>> plugins,notext
>>pdf can be viewed as sepreate images, instead.
>>Ok, back to your problem. I think you missed two things:
>> 1. In the Search Indexes section, check the section checkbox to
>> build
>>the indexes on section level (the converted images%