[greenstone-devel] problem with accented characters from PDF file

From Robert Sleator
DateWed, 1 Oct 2003 17:13:22 -0700 (PDT)
Subject [greenstone-devel] problem with accented characters from PDF file
__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com


<<attachment>>
Type: text/plain
Filename: list_letter.txt

Hi,

I'm trying to build a collection that includes a Spanish language PDF file.
After the collection builds the PDF version of the file is fine but if I open
the HTML version all the accented characters are corrupted. The "doc.xml"
file in the "archives" directory of that collection shows the same corruption
as I see in the HTML file with a web browser.

Environment:
Greenstone 2.38, build 2.1
latest version of pdftohtml (version 0.36) & associated files from
http://www.greenstone.org/gsdl-patch-pdfplug.zip

Input file is Spanish language PDF.

Here are corresponding snipets of two files run through "od -bc". The first is taken
from the output of pdftohtml run stand-alone on the pdf file.

0000000 122 145 165 156 151 363 156 040 120 162 145 160 141 162 141 143
R e u n i □ n P r e p a r a c
0000020 151 363 156 040 171 040 120 162 145 166 145 156 143 151 363 156
i □ n y P r e v e n c i □ n
0000040 012

0000041

The second is taken from gsdl/collect/<collection_name>/archives/.../HASH....dir/doc.xml:

0000000 122 145 165 156 151 303 203 302 263 156 040 120 162 145 160 141
R e u n i □ 203 □ □ n P r e p a
0000020 162 141 143 151 303 203 302 263 156 040 171 040 120 162 145 166
r a c i □ 203 □ □ n y P r e v
0000040 145 156 143 151 303 203 302 263 156 012
e n c i □ 203 □ □ n
0000052

The log file has the following warning:

Converting <filename>.pdf to HTML format
PDFPlug: WARNING: /usr/local/gsdl/collect/refnres/tmp/<filename>.html was read using utf8 encoding
but appears to be encoded as iso_8859_1.
PDFPlug: passing <filename>.pdf on to HTMLPlug
HTMLPlug: processing <filename>.html

Can anyone suggest what is corrupting characters in the build process ?
And can anyone give me a better idea of precisely what transformations are
performed on the PDF file during import and build ? Where does the HTML
document that is displayed in the browser come from ? Is it generated
on the fly from doc.xml ? If not, where is it stored ?

Thanks.

Robert Sleator