Re: [greenstone-users] utf8 and ISO_8859-1 formats

From Stefan Boddie
DateSat, 27 Sep 2003 10:56:20 +1200
Subject Re: [greenstone-users] utf8 and ISO_8859-1 formats
In-Reply-To (1064341224-3f708ee8e0526-webmail-tulane-edu)
> Hello All,
> Always that I have PDFs in my collections, I get the message:
> >Converting xxx.pdf to HTML format.
> >PDFPlug: WARNING: collect mpxxx.html was read using utf8 encoding but
> appears to be encoded as ISO_8859_1
> >PDFPlug: passing xxx.pdf on to HTMLPlug
> >HTMLPlug: processing xxx.html"
> And now I get a similar message but "encoded as big5".
> Could it be the reason why the HTMLs do not look so good? How can I see
> change the formats? What do I have to do to correct this?

No, these error messages are quite normal. Greenstone's PDF and Word plugins
always force the use of UTF-8 because that's the output encoding used by the
pdftohtml and wvWare programs used by the plugins to convert these formats
to HTML. When the converted HTML files are passed to HTMLPlug it sometimes
complains because it attempts to detect the encoding for itself and gets it

Anyway, it should be safe to ignore these messages.

> Additionally, I have a PDF that "PDFPlug failed to convert to HTML". Its
> is 925KB (127 pages). Could the size be the problem? What is the limit, if

Maybe it's a size thing or maybe it's just a PDF that was created in some
way that pdftohtml can't handle. Try running pdftohtml on the PDF directly
from the command line. That may give you more of a clue as to why it's
failing. There's also a newer version of pdftohtml available from You could try downloading the new version
and seeing if it's any better at converting your problem document.

Good luck,

> Thanks for your help
> Margarita Echeverri
> Payson Center - Tulane University
> _______________________________________________
> greenstone-users mailing list