Re: [greenstone-devel] Word -> HTML conversion failures

From Doug Carter
DateThu, 29 Apr 2004 10:02:33 -0700
Subject Re: [greenstone-devel] Word -> HTML conversion failures
In-Reply-To (408EFB8A-B36F0033-cs-waikato-ac-nz)

We really don't have an option to use Windows tools for this conversion
(like with Visual Basic) because we build and deliver Greenstone on
Linux. Saving the document as HTML is also not an option because of the
poor conversion that Word does to HTML.

One way around this mess, would be for Greenstone to include the source
document even if it doesn't convert. That way, the Word document would
still be available in the library (for download/viewing), there just
wouldn't be an HTML version of the document in the library.

In fact, there are many situations for us, where we have documents that
are forms or templates. In those cases, the end-user will always want the
source (Word, Excel, etc) document, and would never want the converted
HTML document.

How difficult would this be to implement?

Let me know if you need more info.



On Wed, Apr 28, 2004 at 12:32:10PM +1200, Michael Dewsnip wrote:
> Hi Doug,
> Yes, this is becoming increasingly problematic. Whenever we run Greenstone
> workshops we always ask the participants to bring along some files to build
> collections on their own data at the end of the day. These almost always end up
> being Word or PDF files, and it is disappointing to see how many of these aren't
> processed by Greenstone (especially lately).
> Of course, this isn't Greenstone's fault, but it doesn't impress potential users
> of Greenstone much! We have to rely on other open source projects for all sorts
> of different things, and Word and PDF to HTML converters are just one example.
> We don't have the time or the interest to write these complex, and quite
> important, parts of the overall Greenstone system ourself.
> However, we should definitely be keeping up with the latest versions of these
> converters, and this is something we've neglected a bit lately. This has been on
> my "to do" list for a while, but is rapidly getting more important. I hope to do
> this within the next few weeks; certainly before the next release.
> As you point out though, we must also rely on the converters like wvWare being
> updated regularly to cope with new Word versions, etc. If this stops happening,
> I don't know what we'll do. I don't know of any other open source Word -> HTML
> converters, maybe someone else does?
> Ultimately, however, these converters are never going to be perfect. If you
> really wanted the best quality HTML from your Word documents, I think you'd have
> to open them up in Word itself and save them as HTML. Depending on how many
> documents you've got, it might be possible to get someone to do this for you.
> Or, could you use something like Visual Basic and automate this process?
> Regards,
> Michael
> Doug Carter wrote:
> > Hi all,
> >
> > I'm having an increase of Word -> HTML conversion failures. This is mostly
> > due to the increased complexity of the Word documents, especially those
> > that have mixed page layout schemes, lots of macros or complex tables.
> >
> > Most often the errors are like this:
> >
> > Error executing wv converter:
> > ** WARNING **: Invalid seek
> > Diagnostic: (./escher.c:631) Eating type 0xf122
> > Diagnostic: (./escher.c:53) Not a container, panic (200)
> > Diagnostic: (./escher.c:443) Damn found nothing
> > Diagnostic: (./wvWare.c:591) Strange No Graphic Data in the 0x01/0x08 graphic
> >
> > Other times, the wv converter dies with a segmentation fault, or I don't
> > get any output at all.
> >
> > I understand that the conversion program that comes with Greenstone
> > (wvWare) is rather old, and is at version 0.71. I have built and tried
> > using version 1.0, with some measured success, but it also fails on
> > several documents. Looking at the wvWare website, I'm not sure that
> > anyone is actively maintaining this conversion utility.
> >
> > Does anyone have any experience with a more robust Word -> HTML converter?
> > I can only see this problem getting worse, as documents continue to get
> > more complex and/or newer versions of MS Word are released.
> >
> > TIA,
> >
> > Doug Carter
> > Mercy Corps
> >
> > _______________________________________________
> > greenstone-devel mailing list
> >
> >