[greenstone-devel] PDFPlug vs pdftohtml encoding scheme

From Franck Magron
DateThu, 20 Nov 2003 18:31:44 +1100
Subject [greenstone-devel] PDFPlug vs pdftohtml encoding scheme
I am using Greenstone 2.40a Win32 version

While importing PDF files I was getting annoying messages saying

Converting x1002.pdf to HTML format
PDFPlug: WARNING: I:Program Filesgsdlcollect est mpx1002.html was
read using utf8 encoding but appears to be encoded as iso_8859_1.

While digging into PDFPlug.pm, I found the following code :

# pdftohtml will always produce html files encoded as utf-8
if ($self->{'input_encoding'} eq "auto")
{ $self->{'input_encoding'} = "utf8";
$self->{'extract_language'} = 1;

But in fact the -enc option of pdftohtml.pl is not used :
my $return_value=system("$ppthtml_binary "$input_ppt" >

And in this case the default encoding is Latin1 :
GlobalParams.cc of pdftohtml 0.34 source : textEncoding = new

So I got rid of the warning message by replacing utf8 by iso_8859_1 in

Maybe it would be better to use the -enc option of pdftohtml ?

Franck M.