Re: [greenstone-users] PDFPplug failed!!!

From Stefan Boddie
DateSat, 14 Jun 2003 10:19:50 +1200
Subject Re: [greenstone-users] PDFPplug failed!!!
In-Reply-To (20030614024046I-peterson-notredame-ac-jp)
Hi Greg,

It sounds like everything's working ok for you. Two things though:

1. The mysterious "Microsoft Word - Develop-en.doc" is likely to be the
title metadata that's been extracted from the PDF document. The greenstone
manuals were created in MS Word then converted to PDF so Word probably set
this metadata.

2. The PDFPlug plugin uses a third party converter called pdftohtml to
convert the PDF to HTML for indexing. It does a fair job of extracting all
the text but the HTML doesn't always come out looking much like the original
PDF. Some PDFs are converted well while others are not so good.
Unfortunately the Greenstone manuals come out quite badly as you've seen.
Your options for overcoming this are:
a) Display only the original PDF document and hide the extracted HTML from
the end user. You can do that by adding a format statement similar to the
following to your collections collect.cfg file:
format VList "<td valign=top>[srclink][srcicon][/srclink]</td>"
b) If you have ghostscript installed you can use PDFPlug's -complex option
to improve the look of the output html. This will slow down the import
process but in most cases will significantly improve the way the html looks.
To use this option add "-complex" to the end of the "plugin PDFPlug" line in
your collections collect.cfg file then re-import and build. (note that you
must have at least gsdl-2.39 to use -complex).


----- Original Message -----
From: "Greg Peterson" <>
To: <>;
Sent: Saturday, June 14, 2003 5:40 AM
Subject: Re: [greenstone-users] PDFPplug failed!!!

> I am also having some PDF trouble.
> On Fri, 13 Jun 2003 14:47:09 +0530
> "Ravi Verma" <> wrote:
> ...
> > I am using Greenstone 2.38 on windows NT. when i build a collection for
> > pdf documents, PDFPplug failed.
> > Exact error msg is "document.pdf: PDFPlug failed to convert to HTML".
> ...
> I just installed Greenstone-2.39 on Solaris (SPARC) and
> then built a collection of the four Greenstone PDF documents.
> When I view the "titles a-z" I see four entries like this:
> [text icon] [PDF icon] Microsoft Word - Develop-en.doc
> (Develop-2.39-en.pdf)
> I have no idea where the Microsoft Word doc stuff comes from.
> When I built the collection, I just left the configuration (plugins,
> etc.) as it appeared.
> One problem was that some Perl scripts were not executable, so
> I changed the permissions (chmod 0755 $GSDLHOME/bin/scripts/*.pl).
> But that did not change the output of the build process.
> Also, I found that viewing this collection of PDF files as text with
> a graphical browser does not work. When I click the text icon, the
> text briefly appears, and then it is replaced by a "double-wide"
> Greenstone image that overlaps the narrower image along the left
> side. I set my preferences to "textual interface", but that icon on
> the left side still replaces the text. I tried Netscape 7
> (GNU/Linux Japanese version) on FreeBSD, Microsoft Internet
> Explorer 5.5 (Japanese) on Windows ME, and Mozilla 1.4b on Microsoft
> Windows ME.
> When I view the collection with a text-mode browser (lynx or w3m),
> it works okay. These browsers do not use JavaScript, so I suspect a
> problem in the Greenstone JavaScript code.
> This problem does not occur with a collection of e-mail messages
> that were in separate files. I have not tried other file formats.
> Greg Peterson <>
> Kyoto Notre Dame University
> _______________________________________________
> greenstone-users mailing list