Re: Re: [greenstone-users] Missing documents

From John Rose
DateWed, 24 Jan 2007 17:12:58 +0100
Subject Re: Re: [greenstone-users] Missing documents
Dear Shaoqun,

My problem was indeed the failure to check the
section checkbox in the Search Indexes section
(which I understand will not necessary in future
versions unless you really want section level
indexes, as opposed to document pages treated by Greenstone as sections).

My two plugins to handle respectively the pdf
files to be treated as images and those to be treated as text are:

plugin PDFPlug -convert_to pagedimg_jpg -process_exp Notext.*.pdf
plugin PDFPlug -convert_to html

The two pdf files which were previously rejected
by PDFPlug (one which was only images, and the
other which was generated so as not to be
copiable) were put in the Notext directory and
yielded image displays as expected.

However, I have another pdf file which is also
only images, except that one of these is
referenced as a table of contents (the file is
3.5 MB and will not go by email, but I could FTP
it to you). When I try to treat this as image
data by putting it in the Notext directory, I get
a display with all blank pages (one for each page
in the document). Is there anything I can do fix
this (I guess it is because the file can be
treated by the normal PDFPlug parameters, even
though this does not yield any text extraction)?

Thanks and best regards, John


At 21:56 10/01/2007, you wrote:
>Hello John,
>
>Sorry for the late reply.
>
>What PDFPlug with the "convert_to pagedimg_jpg" option does is convert the
>pages of a pdf file into seperate images which are ultimately processed by
>PagedImgPlug. If you go to the
>"greenstone-home/collect/[collection]/archives/[HASHID]/, you will see all
> converted images. However, no text are extracted by either plugins,notext
>pdf can be viewed as sepreate images, instead.
>Ok, back to your problem. I think you missed two things:
> 1. In the Search Indexes section, check the section checkbox to build
>the indexes on section level (the converted images are treated as
>sections)
> 2. Edit the DocumentText format statement to read as
> {Or}{[srcicon],[Text]}
> (display notext pdf as images, and text for others)
>Regards
>Shaoqun
>
>
>
> > Dear Shaoqun,
> >
> > I have tried to follow the instructions on
> > enhanced pdf handling. I put the two rejected pdf
> > documents (one with only images, the other which
> > does not authorize copy or extraction) in a
> > sub-directory called Notext in the import
> > directory of the collection. I added an
> > additional PDFPlug before the normal one which
> > shows in the collect.cfg file as:
> >
> > plugin PDFPlug -convert_to pagedimg_jpg -process_exp
> > "Notext.*.pdf"
> >
> > I have both ImageMagick and Ghostscript installed.
> >
> > When I build the collection I don't get any
> > content for these documents when I click on the
> > icons for in the html files in the browsing
> > display (just the document title). I noticed that
> > in the execution list the PDFPlug is not
> > mentioned, but rather only treatment of files in
> > the tmp sub-directory by HTMLPlug, and the last two lines are:
> >
> > The file C:Program
> >
> FilesGreenstonecollectsidavih mpFormationdesorganismesrelaisFormationdesorganismesrelais.item
> > is being processed by PagedImgPlug.
> > The file C:Program
> > FilesGreenstonecollectsidavih mpFormationFAWECAMFormationFAWECAM.item
> > is being processed by PagedImgPlug.
> >
> > Could you please advise? Thanks, John
> >
> >
> >>Message: 7
> >>Date: Thu, 4 Jan 2007 14:44:55 +1300 (NZDT)
> >>From: sw64@cs.waikato.ac.nz
> >>Subject: Re: [greenstone-users] Missing documents
> >>To: "John Rose" <johnrose@alumni.caltech.edu>
> >>Cc: greenstone-users@list.scms.waikato.ac.nz
> >>Message-ID:
> >>
> <44636.130.217.244.2.1167875095.squirrel@webmail.scms.waikato.ac.nz>
> >>Content-Type: text/plain;charset=iso-8859-1
> >>
> >>Hello John,
> >>
> >> > When I built the collection, I found that two of
> >> > the pdf documents were rejected (see below), and
> >> > the others seemed to be processed normally. I
> >> > believe that searching worked for the processed
> >> > documents, but when I tried to display them in
> >> > browsing classifiers, those with filenames of
> >> > more than 36 characters (but which were handled
> >> > without problems by Windows) would not display
> >> > (at least with the default VList). When I
> >> > shortened the filenames and tried again, I found
> >> > that the documents with filenames with French
> >> > accented characters would not display with the
> >> > browsing classifiers (although they apparently
> >> > did display when found by search). When I took
> >> > out the accents, all 14 are displayed normally.
> >> > Is this a bug or is there a way to get around it?
> >>
> >>Thanks for pointing it out and I'll look into it.
> >>
> >> > Concerning the two rejected pdf documents, one
> >> > consists only of images, whereas the other seems
> >> > to have been prepared in some sort of secure mode
> >> > (for example, one cannot cut and paste from
> >> > selected text). [Another pdf document consisting
> >> > of images, but with an active table of contents,
> >> > was processed normally.] Does anyone know how I
> >> > could easily get the missed files into the
> >> > collection with their metadata,
> >>
> >>You can use UnknownPlug to get them into the collection.
> >>Or you can add another PDFPlug with "convert to" option to convert the
> >>rejected pdfs to images, for more detail about this, please see
> >>http://greenstone.sourceforge.net/wiki/gsdoc/tutorial/en/enhanced_pdf.htm
> >>However, in either way, only limited metatada (no title, author etc.) can
> >>be automatically extracted-- you need to add more manually either in the
> >>Enrich pane or the metadata.xml file.
> >>
> >> > and if possible
> >> > with the text of the rejected textual pdf file?
> >>Sorry, Greenstone uses the third-party software to extract text, there is
> >>nothing it can do.
> >>
> >>Regards
> >>Shaoqun
> >
> >
> > John B. Rose
> > 1 Bis, Rue des Châtre-Sacs
> > 92310 Sèvres
> > France
> > Email: <johnrose@alumni.caltech.edu>
> >
> >
> > _______________________________________________
> > greenstone-users mailing list
> > greenstone-users@list.scms.waikato.ac.nz
> > https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
> >
> >


John B. Rose
1 Bis, Rue des Châtre-Sacs
92310 Sèvres
France
Email: <johnrose@alumni.caltech.edu>