[greenstone-devel] Re: Word -> HTML conversion failures

From spdegabrielle
DateFri, 30 Apr 2004 15:43:47 +0900
Subject [greenstone-devel] Re: Word -> HTML conversion failures
Another option is to use to index whatever text/html you get from the source document (no matter how broken/ugly) but link the original using [srclink][Title][/srclink] in your 'format' strings. That way your endusers don't get to see the messy failed conversion to html, but still get the biggest benefit of greenstone; a quality fast and cheap full-text search of their collecton/archive.

I have used this with OCR'd text hidden behind page images in PDF's because uncorrected OCR'd text is extremely ugly. I'd get heaps of warnings but it did survive. It even returned page numbers!

I also found this tool to convert batches of documents of many types with Open Office:
http://kosh.datateamsys.com/~danny/OOo/Tools/
(It converts to PDF too.)

The extreme sports/bungy jumping option is to write your own plugin...

s.

> browsing), and you don't
> have any text to index (so you can't get to the document by
> searching).