Actually, I've been testing with Tesseract and as you said it is not 100%
accurate, I did never hope it would. Anyway, I□m getting a satisfactory
OCR output (most documents are in spanish) and I'm planning to write some
script to clean up the text to keep only names, dates and some important
data. I don't really want to overload Greenstone with unnecesary
information but rather to be able to search inside Image PDFs.
I'll give a try to gOCR and VueScan and see the results.
> I believe your main choices on linux will be gOCR and OCROPUS/
> Tesseract, two open source OCR projects. In either case, you will
> need, as part of your workflow, a pass to clean up the recognized
> text. No OCR program is 100% accurate, even on beautifully scanned
> text. Both these programs do a good job with English, and gOCR can
> be retargeted to another "roman alphabet" language easily by
> recompiling with the appropriate aspell dictionary for that language.
> You might also check out VueScan scanning software which has OCR and
> comes with a linux version. Its great scanning software but I have
> not used the OCR feature.
> Hope this helps some.
> Rus Sheptak
> Research Associate
> Archaeological Research Facility
> University of California, Berkeley
Diego Nicol□s Casar Gonz□lez
Tel: (+54) 011 5252.0810
Movil: 15 4186.1334
Pe□a 2056 : Piso 7 B
Capital Federal : Argentina