Re: PDFPlug from batchplug.zip on win98

From Stefan Boddie
DateWed, 26 Feb 2003 08:58:31 +1300
Subject Re: PDFPlug from batchplug.zip on win98
In-Reply-To (465d66306ee872a9be30fa2df2dd4eea-www1-mail-post-cz)
Hi Roman,

The PDF plugin ignores the -input_encoding option (as does the word plugin)
because the output from pdftohtml is always utf-8 encoded. Greenstone still
goes ahead and attempts to detect the encoding itself though, hence the
warning it generates saying that PDFPlug is using utf-8 but the encoding
detection code thinks it's something else. In the case of utf-8 encoded text
the encoding detection package is normally wrong because it doesn't have
support for utf-8 in most languages.

The main points of all this are:

* PDFPlug and WordPlug will ignore -input_encoding options.

* PDFPlug and WordPlug will always treat input as utf-8 as their converters
(pdftohtml and wvWare respectively) are set up to produce utf-8.

* You can safely ignore the warning messages about documents being read
using the wrong encoding.

If your PDF documents are being converted to rubbish by greenstone it's most
likely because pdftohtml is failing to convert them correctly (as I
discovered recently with vietnamese PDFs). If that's the case, try playing
with pdftohtml.exe directly (see gsdlbinscriptpdftohtml.pl for the
command line greenstone uses to run pdftohtml.exe) and see if it's
converting your PDFs to HTML correctly. If it's not there's not much we can
do about it, other than attempting to patch pdftohtml ourselves or asking
the pdftohtml maintainers if they can help.

Stefan.

----- Original Message -----
From: "Roman Chyla" <r.ca@post.cz>
To: "greenstone users" <greenstone@tripath.colosys.net>
Sent: Wednesday, February 26, 2003 4:11 AM
Subject: PDFPlug from batchplug.zip on win98


> Hi,
> I installed files from PDFPlug batch (I placed the files in their
> right destinations)
> http://www.greenstone.org/gsdl-patch-pdfplug.zip
>
> It seems that on Windows 98 there were no encoding options send
> to PDFPlugin.
>
> I tried
>
> PDFPlug -default_endoding auto
> utf8
> windows_1250)
> The message remained the same: Warning the document was read
> using utf8 encoding but appears to be iso_8859-2 encoded (which
> was not right :-)
>
> cheers
> Roman Chyla
>