Thanks for your very helpful reply.
I now see that I'll have to display PDF files if I want to highlight the search terms because the quality of the OCR'd text is not very good and I haven't the time to proofread all 560 pages.
I found a way of launching a PDF file from a URL (looking something like a GET query) with a search term pre-loaded so that the Acrobat reader (or application) highlights the first occurrence of the term and the search pane is open, displaying all occurrences of the term. Perhaps I can get Greenstone to construct such URLs.
From: Guillaume Hatt [firstname.lastname@example.org]
Sent: 26 June 2009 14:57
To: PG-PG-Kriwaczek, Alexander; email@example.com
Subject: Re: [greenstone-users] PDF files with searchable scanned images
I don't have answers to all these questions, but we have built a
collection of PDF files + OCR + metadata with pdftk and Greenstone. You
can see the result at page
If you have other questions, I can try to answer.
For your questions, I think :
1. No, only on the text version (you can try on our site)
3. No, the PDF files are not generated after the search.
Has anybody other answers ? I would be glad to see I'm wrong.
PG-PG-Kriwaczek, Alexander a □crit :
> I'm building a digital library to hold 560 scanned images of newspaper
> pages. I've completed the scanning and am about to embark on the OCR.
> With the help of OmniPage and Acrobat, I'm able to produce individual
> PDF files containing searchable high quality images, which in Acrobat
> allows search terms to be displayed directly on the scanned image, as
> described on page 182 of Witten and Bainbridge's book "How to Build a
> Digital Library". These PDF files convert successfully to HTML using the
> pdftohtml.exe application, which I believe underlies the PDF plugin.
> Before carrying out the OCR on all 560 images there are certain issues
> that I'm not entirely sure about, despite having looked at the
> documentation. I would be really grateful if someone could clarify at
> least some of the following questions so that I can have a clearer view
> of the road ahead:
> 1. If you import such a PDF file into Greenstone, will the digital
> library offer the ability to display search terms directly on the
> scanned image, as in Acrobat? I noticed that the Papers Past digital
> library has something like this facility, although it seems as if you
> cannot see the highlighted search term in full page view.
> 2. When viewing the scanned image of a full newspaper page in
> Greenstone, is it possible, in theory, to change the scale of the image
> ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too
> don't know whether one can incorporate this code into a Greenstone
> digital library?
> 3. If it is not possible to endow Greenstone HTML pages with the ability
> to highlight search terms directly on scanned images or to change the
> scale of images, can one instead have Greenstone open the original PDF
> files, with the help of an Acrobat viewer on the client machine? If so,
> is it possible to pre-load the search term so that it is highlighted on
> the page as soon as the viewer opens, rather than having to type in the
> term again into the Acrobat search box?
> Alex Kriwaczek.
> greenstone-users mailing list
Ecole nationale des chartes
19 rue de la Sorbonne
Courriel : firstname.lastname@example.org
T□l. : 01 55 42 75 05