[greenstone-users] 2.82 has a drawback with Word documents and bundled wvWare 0.7.1

From Mariana Pichinini
DateTue Jun 30 09:27:18 2009
Subject [greenstone-users] 2.82 has a drawback with Word documents and bundled wvWare 0.7.1
Hi to everyone

We like to share a (now solved) issue we had in migrating to G.2.82, in
the hope it can be helpful.
We downloaded version 2.82 for a linux debian server, and begun testing it
some days ago, coldly planning to replace & dismiss our good ol' 2.74.
Till now we have had no trouble in dealing with .doc documents, which
amount for the main part of our collection sources. But now, when testing
2.82, we found a issue. We got this:

**** Error is:
not well-formed (invalid token) at line 8293, column 33, byte 607053 at
/usr/lib/perl5/XML/Parser.pm line 187.

This error repeated, estimately, for a half of the total of Word-sourced
documents (quite a lot, I'd say).

if we move to the line & char we've been told in it, we find there an
unknown (invalid) character to a utf-8 encoding.

Furthermore, the lines always are of this kind:

<p><div name="Cuerpo de texto con sangr&#65533;a" align="left" style="
padding: 0.00mm 0.00mm 0.00mm 0.00mm; ">

We know the invalid character pertains to the style section of the
original Word DOC, (since "Cuerpo de texto con sangrma" is the name of a
style when in Openoffice-MsWord world).

So, the problem arises when importing the .doc to the xml file. We should
note at this point that there is no issue in converting the body of the
document. Here we reach the point.
We found the offending characters all in a "name=" HTML attribute of the
<div> tags, in the <content> section in doc.xml. So we followed the
building process more closely; and we found that if we replace the bundled
wvWare with a soft link to the system version we possess in /usr/bin
(that's actually the wvWare our Greenstone 2.74 uses), the issue was gone.
We can now conclude that the bundled wvWare for 2.82 (v.0.7.1), copy every
style name as a "name" attribute to the corresponding DIV tag in the
resulting HTML version: but, in doing so does NOT convert the encoding, so
if you got in the "name" characters outside the common ascii core... ArgH!
You'll have an error, the document will not be indexed and so, will not be
findable by any conceivable search.
As stated, we actually are using wvWare 1.2.4. It seems that the issue
got solved because this version does not include any "name" attribute in
HTML tags.
That said, we cannot confirm if, were wvWare 1.2.4 to do this "name"
things or any in the HTML source code, by any reason, it would do the
proper encoding of HTML, as it does with the document content.
That could be a bug in wvWare, and in every case is a possible stopper for
building collections from .doc, so the present report.

Best regards

Lic. Mariana Pichinini
Enrique Merle
BIBHUMA - Biblioteca Profesor Guillermo Obiols
Facultad de Humanidades y Ciencias de la Educacisn
Universidad Nacional de La Plata
Calle 48 entre 6 y 7 - 1er subsuelo
B1900AMW LA PLATA, Argentina
Telefax: +54-221-4230125 interno 162 (lmneas rotativas)
WEB: www.bibhuma.fahce.unlp.edu.ar