Re: [greenstone-devel] Word -> HTML conversion failures

From Michael Dewsnip
DateWed, 28 Apr 2004 12:32:10 +1200
Subject Re: [greenstone-devel] Word -> HTML conversion failures
In-Reply-To (20040422182939-GA28388-mercycorps-org)
Hi Doug,

Yes, this is becoming increasingly problematic. Whenever we run Greenstone
workshops we always ask the participants to bring along some files to build
collections on their own data at the end of the day. These almost always end up
being Word or PDF files, and it is disappointing to see how many of these aren't
processed by Greenstone (especially lately).

Of course, this isn't Greenstone's fault, but it doesn't impress potential users
of Greenstone much! We have to rely on other open source projects for all sorts
of different things, and Word and PDF to HTML converters are just one example.
We don't have the time or the interest to write these complex, and quite
important, parts of the overall Greenstone system ourself.

However, we should definitely be keeping up with the latest versions of these
converters, and this is something we've neglected a bit lately. This has been on
my "to do" list for a while, but is rapidly getting more important. I hope to do
this within the next few weeks; certainly before the next release.

As you point out though, we must also rely on the converters like wvWare being
updated regularly to cope with new Word versions, etc. If this stops happening,
I don't know what we'll do. I don't know of any other open source Word -> HTML
converters, maybe someone else does?

Ultimately, however, these converters are never going to be perfect. If you
really wanted the best quality HTML from your Word documents, I think you'd have
to open them up in Word itself and save them as HTML. Depending on how many
documents you've got, it might be possible to get someone to do this for you.
Or, could you use something like Visual Basic and automate this process?

Regards,

Michael

Doug Carter wrote:

> Hi all,
>
> I'm having an increase of Word -> HTML conversion failures. This is mostly
> due to the increased complexity of the Word documents, especially those
> that have mixed page layout schemes, lots of macros or complex tables.
>
> Most often the errors are like this:
>
> Error executing wv converter:
> ** WARNING **: Invalid seek
> Diagnostic: (./escher.c:631) Eating type 0xf122
> Diagnostic: (./escher.c:53) Not a container, panic (200)
> Diagnostic: (./escher.c:443) Damn found nothing
> Diagnostic: (./wvWare.c:591) Strange No Graphic Data in the 0x01/0x08 graphic
>
> Other times, the wv converter dies with a segmentation fault, or I don't
> get any output at all.
>
> I understand that the conversion program that comes with Greenstone
> (wvWare) is rather old, and is at version 0.71. I have built and tried
> using version 1.0, with some measured success, but it also fails on
> several documents. Looking at the wvWare website, I'm not sure that
> anyone is actively maintaining this conversion utility.
>
> Does anyone have any experience with a more robust Word -> HTML converter?
> I can only see this problem getting worse, as documents continue to get
> more complex and/or newer versions of MS Word are released.
>
> TIA,
>
> Doug Carter
> Mercy Corps
>
> _______________________________________________
> greenstone-devel mailing list
> greenstone-devel@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel