|Date||Fri, 30 Apr 2004 15:43:47 +0900|
|Subject||[greenstone-devel] Re: Word -> HTML conversion failures|
|Another option is to use to index whatever text/html you get from the source document (no matter how broken/ugly) but link the original using [srclink][Title][/srclink] in your 'format' strings. That way your endusers don't get to see the messy failed conversion to html, but still get the biggest benefit of greenstone; a quality fast and cheap full-text search of their collecton/archive.
I have used this with OCR'd text hidden behind page images in PDF's because uncorrected OCR'd text is extremely ugly. I'd get heaps of warnings but it did survive. It even returned page numbers!
I also found this tool to convert batches of documents of many types with Open Office:
The extreme sports/bungy jumping option is to write your own plugin...
> browsing), and you don't