[greenstone-devel] PDF files with searchable scanned images

From PG-PG-Kriwaczek, Alexander
DateSat Jun 27 00:10:27 2009
Subject [greenstone-devel] PDF files with searchable scanned images

I'm building a digital library to hold 560 scanned images of newspaper pages. I've completed the scanning and am about to embark on the OCR. With the help of OmniPage and Acrobat, I'm able to produce individual PDF files containing searchable high quality images, which in Acrobat allows search terms to be displayed directly on the scanned image, as described on page 182 of Witten and Bainbridge's book "How to Build a Digital Library". These PDF files convert successfully to HTML using the pdftohtml.exe application, which I believe underlies the PDF plugin.

Before carrying out the OCR on all 560 images there are certain issues that I'm not entirely sure about, despite having looked at the documentation. I would be really grateful if someone could clarify at least some of the following questions so that I can have a clearer view of the road ahead:

1. If you import such a PDF file into Greenstone, will the digital library offer the ability to display search terms directly on the scanned image, as in Acrobat? I noticed that the Papers Past digital library has something like this facility, although it seems as if you cannot see the highlighted search term in full page view.

2. When viewing the scanned image of a full newspaper page in Greenstone, is it possible, in theory, to change the scale of the image ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too hard to build such a web-based image viewer using JavaScript, but I don't know whether one can incorporate this code into a Greenstone digital library?

3. If it is not possible to endow Greenstone HTML pages with the ability to highlight search terms directly on scanned images or to change the scale of images, can one instead have Greenstone open the original PDF files, with the help of an Acrobat viewer on the client machine? If so, is it possible to pre-load the search term so that it is highlighted on the page as soon as the viewer opens, rather than having to type in the term again into the Acrobat search box?


Alex Kriwaczek.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://list.scms.waikato.ac.nz/mailman/private/greenstone-devel/attachments/20090626/ec998873/attachment.html