Re: [greenstone-devel] doc.xml plugin errors

From John R. McPherson
DateFri, 16 Jan 2004 11:29:47 +1300
Subject Re: [greenstone-devel] doc.xml plugin errors
In-Reply-To (400702EE-FC7E6D17-cs-waikato-ac-nz)
Michael Dewsnip wrote:
> Hi Doug,
>
> Yes, the error message isn't very helpful in determining which files are bad, is
> it?
>
> Attached is the patch (zipped) - just unzip it and replace your existing
> bin/script/pdftohtml.pl with the new one, then re-import and re-build your
> collection.


> Doug Carter wrote:
>>Michael,
>>
>>Yes I am building collections with PDF files, but I have no idea if
>>these strange characters are causing the problem, because I'm not
>>sure which files they are referring to.
>>
>>So yes, could you please send me the patch?
>>
>>Best,
>>
>>Doug
>>
>>On Fri, Jan 16, 2004 at 09:35:19AM +1300, Michael Dewsnip wrote:
>>
>>>Hi Doug,
>>>
>>>Are you by any chance building collections containing PDF files? There is a
>>>problem with the handling of certain PDF files which causes two strange
>>>characters to be added into the Title metadata extracted from these files.
>>>This means the doc.xml files fail to parse correctly, and during building you
>>>get the error you report.

The problem here is two bytes at the start of the extracted text, bytes
xFE and xFF. Anyone familiar with unicode should recognise this as a
special "endian" marker.

In the unicode standard, "FEFF" is a zero-width non-breaking space
(which sounds pretty useless :p) and FFFE is not a valid unicode
character, but it is also used as a "byte order marker" if it is at the
very start of a document.

The unicode 4.0 spec says "These codes are intended for process internal
uses, but are not permitted for interchange."

The idea is that if the first 2 bytes of a unicode stream are <ff><fe>
then the software can tell that it is not a valid character and that the
file is using a different byte order.
It seems to have become common for some software to put either <ff><fe>
or <fe><ff> at the start of documents to indicate byte order, even
though <fe><ff> is a valid character.

Anyway, to relate this back to our problem, either pdftohtml needs
updated code to behave like these other apps and detect and remove those
bytes if they exist, or the other apps should follow the strict letter
of the standard.

Since pdftohtml is spitting out utf-8, it doesn't need to output the
byte order marker anyway so I'm sure this problem will eventually
be fixed by the pdftohtml authors, but for now we handle it in our
pdftohtml.pl script handler

(See http://www.unicode.org/unicode/faq/utf_bom.html)

Hope this clarifies the problem for anyone who cares....

John