Re: [greenstone-devel] PDFPlug vs pdftohtml encoding scheme

From John R. McPherson
DateThu, 20 Nov 2003 22:57:34 +1300
Subject Re: [greenstone-devel] PDFPlug vs pdftohtml encoding scheme
In-Reply-To (000701c3af38$5965a990$50c8a8c0-Odin)
On Thu, Nov 20, 2003 at 06:31:44PM +1100, Franck Magron wrote:

> While importing PDF files I was getting annoying messages saying
> Converting x1002.pdf to HTML format
> PDFPlug: WARNING: I:Program Filesgsdlcollect est mpx1002.html was
> read using utf8 encoding but appears to be encoded as iso_8859_1.

The reason this message occurs is if the automatically detected encoding
(as detected by "textcat", using probabilistic models) is different to
the specified encoding (as claimed by PDFPlug, which passes utf-8).
PDF sets the encoding to utf-8, which is correct, but doesn't set the
language, so textcat is used to guess the language.

Now, the problem is that textcat doesn't have a model for
English/utf-8 since english characters are all ascii... the closest
match for text like this is english/iso-8859-1. So BasPlug warns that
the detected encoding seems to be different but no harm is done.

Maybe we could eventually add another model for english/utf-8, but it
isn't really causing a problem at the moment other than a warning message.

(I guess part of it is that the textcat models don't account for the fact
that while english letters are all ascii/iso-8859-1, there are unicode
symbols like single/double quotes and non-breaking spaces that are utf-8

John McPherson