RE: [greenstone-devel] Word -> HTML conversion failures

From Emanuel Dejanu / Simple Words
DateWed, 28 Apr 2004 11:41:35 +0300
Subject RE: [greenstone-devel] Word -> HTML conversion failures
In-Reply-To (408EFB8A-B36F0033-cs-waikato-ac-nz)
To keep it in the open source context I think that OpenOffice is a good
alternative to convert Word documents.
The HTML files from OpenOffice are much cleaner than the HTML files from
Another point for OpenOffice is that is running on Unix like systems.

What will be good is a macro that will convert all doc, rtf files from a
given directory.

Best regards,

Emanuel Dejanu

-----Original Message-----
[] On Behalf Of
Michael Dewsnip
Sent: Wednesday, April 28, 2004 3:32 AM
To: Doug Carter
Cc: Greenstone Mailing List
Subject: Re: [greenstone-devel] Word -> HTML conversion failures

Hi Doug,

Yes, this is becoming increasingly problematic. Whenever we run Greenstone
workshops we always ask the participants to bring along some files to build
collections on their own data at the end of the day. These almost always end
up being Word or PDF files, and it is disappointing to see how many of these
aren't processed by Greenstone (especially lately).

Of course, this isn't Greenstone's fault, but it doesn't impress potential
users of Greenstone much! We have to rely on other open source projects for
all sorts of different things, and Word and PDF to HTML converters are just
one example.
We don't have the time or the interest to write these complex, and quite
important, parts of the overall Greenstone system ourself.

However, we should definitely be keeping up with the latest versions of
these converters, and this is something we've neglected a bit lately. This
has been on my "to do" list for a while, but is rapidly getting more
important. I hope to do this within the next few weeks; certainly before the
next release.

As you point out though, we must also rely on the converters like wvWare
being updated regularly to cope with new Word versions, etc. If this stops
happening, I don't know what we'll do. I don't know of any other open source
Word -> HTML converters, maybe someone else does?

Ultimately, however, these converters are never going to be perfect. If you
really wanted the best quality HTML from your Word documents, I think you'd
have to open them up in Word itself and save them as HTML. Depending on how
many documents you've got, it might be possible to get someone to do this
for you.
Or, could you use something like Visual Basic and automate this process?



Doug Carter wrote:

> Hi all,
> I'm having an increase of Word -> HTML conversion failures. This is
> mostly due to the increased complexity of the Word documents,
> especially those that have mixed page layout schemes, lots of macros or
complex tables.
> Most often the errors are like this:
> Error executing wv converter:
> ** WARNING **: Invalid seek
> Diagnostic: (./escher.c:631) Eating type 0xf122
> Diagnostic: (./escher.c:53) Not a container, panic (200)
> Diagnostic: (./escher.c:443) Damn found nothing
> Diagnostic: (./wvWare.c:591) Strange No Graphic Data in the 0x01/0x08
> graphic
> Other times, the wv converter dies with a segmentation fault, or I
> don't get any output at all.
> I understand that the conversion program that comes with Greenstone
> (wvWare) is rather old, and is at version 0.71. I have built and tried
> using version 1.0, with some measured success, but it also fails on
> several documents. Looking at the wvWare website, I'm not sure that
> anyone is actively maintaining this conversion utility.
> Does anyone have any experience with a more robust Word -> HTML converter?
> I can only see this problem getting worse, as documents continue to
> get more complex and/or newer versions of MS Word are released.
> TIA,
> Doug Carter
> Mercy Corps
> _______________________________________________
> greenstone-devel mailing list

greenstone-devel mailing list