MORE: Re: Re: [greenstone-users] Missing documents

From John Rose
DateWed, 10 Jan 2007 09:46:47 +0100
Subject MORE: Re: Re: [greenstone-users] Missing documents
Dear Shaoqun,

I should have added that my attempt to modify the
VList according the the "enhanced pdf" instructions was:

<td valign="top">[link][icon][/link]</td>
<td
valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign="top">[highlight]
{Or}{[dls.Title],[dc.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>

Best regards, waiting, John


>Date: Tue, 09 Jan 2007 22:04:03 +0100
>From: John Rose <johnrose@alumni.caltech.edu>
>Subject: Re: Re: [greenstone-users] Missing documents
>To: greenstone-users@list.scms.waikato.ac.nz
>Message-ID: <7.0.1.0.2.20070109195120.0228feb8@alumni.caltech.edu>
>Content-Type: text/plain; charset="iso-8859-1"; format=flowed
>
>Dear Shaoqun,
>
>I have tried to follow the instructions on
>enhanced pdf handling. I put the two rejected pdf
>documents (one with only images, the other which
>does not authorize copy or extraction) in a
>sub-directory called Notext in the import
>directory of the collection. I added an
>additional PDFPlug before the normal one which
>shows in the collect.cfg file as:
>
>plugin PDFPlug -convert_to pagedimg_jpg -process_exp "Notext.*.pdf"
>
>I have both ImageMagick and Ghostscript installed.
>
>When I build the collection I don't get any
>content for these documents when I click on the
>icons for in the html files in the browsing
>display (just the document title). I noticed that
>in the execution list the PDFPlug is not
>mentioned, but rather only treatment of files in
>the tmp sub-directory by HTMLPlug, and the last two lines are:
>
>The file C:Program
>FilesGreenstonecollectsidavih mpFormationdesorganismesrelaisFormationdesorganismesrelais.item
>
>is being processed by PagedImgPlug.
>The file C:Program
>FilesGreenstonecollectsidavih mpFormationFAWECAMFormationFAWECAM.item
>is being processed by PagedImgPlug.
>
>Could you please advise? Thanks, John
>
>
> >Message: 7
> >Date: Thu, 4 Jan 2007 14:44:55 +1300 (NZDT)
> >From: sw64@cs.waikato.ac.nz
> >Subject: Re: [greenstone-users] Missing documents
> >To: "John Rose" <johnrose@alumni.caltech.edu>
> >Cc: greenstone-users@list.scms.waikato.ac.nz
> >Message-ID:
> >
> <44636.130.217.244.2.1167875095.squirrel@webmail.scms.waikato.ac.nz>
> >Content-Type: text/plain;charset=iso-8859-1
> >
> >Hello John,
> >
> > > When I built the collection, I found that two of
> > > the pdf documents were rejected (see below), and
> > > the others seemed to be processed normally. I
> > > believe that searching worked for the processed
> > > documents, but when I tried to display them in
> > > browsing classifiers, those with filenames of
> > > more than 36 characters (but which were handled
> > > without problems by Windows) would not display
> > > (at least with the default VList). When I
> > > shortened the filenames and tried again, I found
> > > that the documents with filenames with French
> > > accented characters would not display with the
> > > browsing classifiers (although they apparently
> > > did display when found by search). When I took
> > > out the accents, all 14 are displayed normally.
> > > Is this a bug or is there a way to get around it?
> >
> >Thanks for pointing it out and I'll look into it.
> >
> > > Concerning the two rejected pdf documents, one
> > > consists only of images, whereas the other seems
> > > to have been prepared in some sort of secure mode
> > > (for example, one cannot cut and paste from
> > > selected text). [Another pdf document consisting
> > > of images, but with an active table of contents,
> > > was processed normally.] Does anyone know how I
> > > could easily get the missed files into the
> > > collection with their metadata,
> >
> >You can use UnknownPlug to get them into the collection.
> >Or you can add another PDFPlug with "convert to" option to convert the
> >rejected pdfs to images, for more detail about this, please see
> >http://greenstone.sourceforge.net/wiki/gsdoc/tutorial/en/enhanced_pdf.htm
> >However, in either way, only limited metatada (no title, author etc.) can
> >be automatically extracted-- you need to add more manually either in the
> >Enrich pane or the metadata.xml file.
> >
> > > and if possible
> > > with the text of the rejected textual pdf file?
> >Sorry, Greenstone uses the third-party software to extract text, there is
> >nothing it can do.
> >
> >Regards
> >Shaoqun
>
>
> John B. Rose
> 1 Bis, Rue des Châtre-Sacs
> 92310 Sèvres
> France
> Email: <johnrose@alumni.caltech.edu>


John B. Rose
1 Bis, Rue des Châtre-Sacs
92310 Sèvres
France
Email: <johnrose@alumni.caltech.edu>