Re: [greenstone-users] PDFPplug failed!!!

From John R. McPherson
DateSat, 14 Jun 2003 10:27:18 +1200
Subject Re: [greenstone-users] PDFPplug failed!!!
In-Reply-To (20030614024046I-peterson-notredame-ac-jp)
On Sat, Jun 14, 2003 at 02:40:46AM +0900, Greg Peterson wrote:
> I am also having some PDF trouble.

Please everyone, there is no need to cross post messages to both the
greenstone-users list and the greenstone-devel list.

> On Fri, 13 Jun 2003 14:47:09 +0530
> "Ravi Verma" <> wrote:
> ...
> > I am using Greenstone 2.38 on windows NT. when i build a collection for some
> > pdf documents, PDFPplug failed.
> > Exact error msg is "document.pdf: PDFPlug failed to convert to HTML". Pls
> ...

There are lots of reasons why a PDF file can't be converted to html.
Greenstone uses a third-party program called "pdftohtml" (which we
distribute with greenstone) to perform this task.

> When I view the "titles a-z" I see four entries like this:
> [text icon] [PDF icon] Microsoft Word - Develop-en.doc
> (Develop-2.39-en.pdf)
> I have no idea where the Microsoft Word doc stuff comes from.

In this case, the PDF file was created from a Word document. Word
and/or Adobe Acrobat must have given that title to the pdf.

I don't have time right now to list all reasons, but some common reasons
for pdftohtml failing include:

* the pdf file having encryption or not allowing permission for text

* the pdf file not containing text, but images of text (eg if created
by scanning in documents)

* The pdf file using bitmapped fonts instead of builtin/postscript

Also, I believe that the pdftohtml.exe binary shipped in the windows
version of greenstone 2.38 and earlier had a problem where it couldn't
process some PDF files that internally used ZIP compression.

Hope this helps
John McPherson