From Guillaume Hatt
DateSat Jun 27 01:58:19 2009
I don't have answers to all these questions, but we have built a
collection of PDF files + OCR + metadata with pdftk and Greenstone. You
can see the result at page
If you have other questions, I can try to answer.

For your questions, I think :
1. No, only on the text version (you can try on our site)
2. No, you must build it with Javascript (and I would be interested too)
3. No, the PDF files are not generated after the search.
Has anybody other answers ? I would be glad to see I'm wrong.

G. Hatt

PG-PG-Kriwaczek, Alexander a écrit :
> Hi,
> I'm building a digital library to hold 560 scanned images of newspaper
> pages. I've completed the scanning and am about to embark on the OCR.
> With the help of OmniPage and Acrobat, I'm able to produce individual
> PDF files containing searchable high quality images, which in Acrobat
> allows search terms to be displayed directly on the scanned image, as
> described on page 182 of Witten and Bainbridge's book "How to Build a
> Digital Library". These PDF files convert successfully to HTML using the
> pdftohtml.exe application, which I believe underlies the PDF plugin.
> Before carrying out the OCR on all 560 images there are certain issues
> that I'm not entirely sure about, despite having looked at the
> documentation. I would be really grateful if someone could clarify at
> least some of the following questions so that I can have a clearer view
> of the road ahead:
> 1. If you import such a PDF file into Greenstone, will the digital
> library offer the ability to display search terms directly on the
> scanned image, as in Acrobat? I noticed that the Papers Past digital
> library has something like this facility, although it seems as if you
> cannot see the highlighted search term in full page view.
> 2. When viewing the scanned image of a full newspaper page in
> Greenstone, is it possible, in theory, to change the scale of the image
> ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too
> hard to build such a web-based image viewer using JavaScript, but I
> don't know whether one can incorporate this code into a Greenstone
> digital library?
> 3. If it is not possible to endow Greenstone HTML pages with the ability
> to highlight search terms directly on scanned images or to change the
> scale of images, can one instead have Greenstone open the original PDF
> files, with the help of an Acrobat viewer on the client machine? If so,
> is it possible to pre-load the search term so that it is highlighted on
> the page as soon as the viewer opens, rather than having to type in the
> term again into the Acrobat search box?
> Regards,
> Alex Kriwaczek.
Guillaume HATT
Informatique documentaire
Ecole nationale des chartes
19 rue de la Sorbonne
75005 Paris
Courriel : guillaume.hatt@enc.sorbonne.fr
Tél. : 01 55 42 75 05
Une pièce jointe non texte a été nettoyée...
