Re: [greenstone-devel] Word -> HTML conversion failures

From Michael Dewsnip
DateFri, 30 Apr 2004 12:29:38 +1200
Subject Re: [greenstone-devel] Word -> HTML conversion failures
In-Reply-To (20040429170233-GA13638-mercycorps-org)
Hi Doug,

> One way around this mess, would be for Greenstone to include the source
> document even if it doesn't convert. That way, the Word document would
> still be available in the library (for download/viewing), there just
> wouldn't be an HTML version of the document in the library.
>
> In fact, there are many situations for us, where we have documents that
> are forms or templates. In those cases, the end-user will always want the
> source (Word, Excel, etc) document, and would never want the converted
> HTML document.

Yes, this is a possibility. The problem with this is that if you can't convert the
document, you can't get any metadata (so you can't do much browsing), and you don't
have any text to index (so you can't get to the document by searching).

The only thing you really know about the file is its filename -- so you could get to
the document via a "filenames" classifier or by searching by filename. Is this
sufficient?

> How difficult would this be to implement?

If having just the filename metadata is sufficient, you can use UnknownPlug to
simulate this, as we've discussed before. You would put your UnknownPlugs at the
bottom of your plugins list, so they would catch any files not processed by WordPlug
etc.

Something like:

plugin UnknownPlug -process_exp ".doc$" -assoc_field "Source"

would catch any unprocessed .doc files and create dummy documents with Source
metadata set to the filename of the file. (Of course, you can also assign other
metadata (eg. Title) manually using metadata.xml files if you wish).

Hope this helps,

Regards,

Michael


> On Wed, Apr 28, 2004 at 12:32:10PM +1200, Michael Dewsnip wrote:
> > Hi Doug,
> >
> > Yes, this is becoming increasingly problematic. Whenever we run Greenstone
> > workshops we always ask the participants to bring along some files to build
> > collections on their own data at the end of the day. These almost always end up
> > being Word or PDF files, and it is disappointing to see how many of these aren't
> > processed by Greenstone (especially lately).
> >
> > Of course, this isn't Greenstone's fault, but it doesn't impress potential users
> > of Greenstone much! We have to rely on other open source projects for all sorts
> > of different things, and Word and PDF to HTML converters are just one example.
> > We don't have the time or the interest to write these complex, and quite
> > important, parts of the overall Greenstone system ourself.
> >
> > However, we should definitely be keeping up with the latest versions of these
> > converters, and this is something we've neglected a bit lately. This has been on
> > my "to do" list for a while, but is rapidly getting more important. I hope to do
> > this within the next few weeks; certainly before the next release.
> >
> > As you point out though, we must also rely on the converters like wvWare being
> > updated regularly to cope with new Word versions, etc. If this stops happening,
> > I don't know what we'll do. I don't know of any other open source Word -> HTML
> > converters, maybe someone else does?
> >
> > Ultimately, however, these converters are never going to be perfect. If you
> > really wanted the best quality HTML from your Word documents, I think you'd have
> > to open them up in Word itself and save them as HTML. Depending on how many
> > documents you've got, it might be possible to get someone to do this for you.
> > Or, could you use something like Visual Basic and automate this process?
> >
> > Regards,
> >
> > Michael
> >
> >
> >
> > Doug Carter wrote:
> >
> > > Hi all,
> > >
> > > I'm having an increase of Word -> HTML conversion failures. This is mostly
> > > due to the increased complexity of the Word documents, especially those
> > > that have mixed page layout schemes, lots of macros or complex tables.
> > >
> > > Most often the errors are like this:
> > >
> > > Error executing wv converter:
> > > ** WARNING **: Invalid seek
> > > Diagnostic: (./escher.c:631) Eating type 0xf122
> > > Diagnostic: (./escher.c:53) Not a container, panic (200)
> > > Diagnostic: (./escher.c:443) Damn found nothing
> > > Diagnostic: (./wvWare.c:591) Strange No Graphic Data in the 0x01/0x08 graphic
> > >
> > > Other times, the wv converter dies with a segmentation fault, or I don't
> > > get any output at all.
> > >
> > > I understand that the conversion program that comes with Greenstone
> > > (wvWare) is rather old, and is at version 0.71. I have built and tried
> > > using version 1.0, with some measured success, but it also fails on
> > > several documents. Looking at the wvWare website, I'm not sure that
> > > anyone is actively maintaining this conversion utility.
> > >
> > > Does anyone have any experience with a more robust Word -> HTML converter?
> > > I can only see this problem getting worse, as documents continue to get
> > > more complex and/or newer versions of MS Word are released.
> > >
> > > TIA,
> > >
> > > Doug Carter
> > > Mercy Corps
> > >
> > > _______________________________________________
> > > greenstone-devel mailing list
> > > greenstone-devel@list.scms.waikato.ac.nz
> > > https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-devel
> >