[greenstone-users] PDF files with searchable scanned images

From PG-PG-Kriwaczek, Alexander
DateSun Jun 28 06:36:41 2009
Subject [greenstone-users] PDF files with searchable scanned images
In-Reply-To (4A44D3E6-9010808-enc-sorbonne-fr)
Hi Guillaume,

I looked at your digital library with great interest and found that your PDF files contain searchable scanned images, just as I intend to do. Using OmniPage, I?m able to save such ?PDF searchable image? files as a result of the OCR process.

The test PDF searchable image files that I have produced can be converted into HTML using the pdftohtml.exe utility. However, it does not look as if anything has been done with the hidden OCR?d text that is embedded in the PDF file. Does this mean that Greenstone?s PDF plugin would likewise ignore the hidden OCR?d text when importing the PDF files into the digital library?

Should I, therefore, get OmniPage to save separate OCR?d text files in addition to the PDF searchable image files, and then additionally import the OCR?d text files into Greenstone? In that case, do you connect the OCR?d text to the PDF files using structural metadata?


Alex Kriwaczek.
From: Guillaume Hatt [guillaume.hatt@enc.sorbonne.fr]
Sent: 26 June 2009 14:57
To: PG-PG-Kriwaczek, Alexander; greenstone-users@list.scms.waikato.ac.nz
Subject: Re: [greenstone-users] PDF files with searchable scanned images


I don't have answers to all these questions, but we have built a
collection of PDF files + OCR + metadata with pdftk and Greenstone. You
can see the result at page
If you have other questions, I can try to answer.

For your questions, I think :
1. No, only on the text version (you can try on our site)
2. No, you must build it with Javascript (and I would be interested too)
3. No, the PDF files are not generated after the search.
Has anybody other answers ? I would be glad to see I'm wrong.

G. Hatt

PG-PG-Kriwaczek, Alexander a ?crit :
> Hi,
> I'm building a digital library to hold 560 scanned images of newspaper
> pages. I've completed the scanning and am about to embark on the OCR.
> With the help of OmniPage and Acrobat, I'm able to produce individual
> PDF files containing searchable high quality images, which in Acrobat
> allows search terms to be displayed directly on the scanned image, as
> described on page 182 of Witten and Bainbridge's book "How to Build a
> Digital Library". These PDF files convert successfully to HTML using the
> pdftohtml.exe application, which I believe underlies the PDF plugin.
> Before carrying out the OCR on all 560 images there are certain issues
> that I'm not entirely sure about, despite having looked at the
> documentation. I would be really grateful if someone could clarify at
> least some of the following questions so that I can have a clearer view
> of the road ahead:
> 1. If you import such a PDF file into Greenstone, will the digital
> library offer the ability to display search terms directly on the
> scanned image, as in Acrobat? I noticed that the Papers Past digital
> library has something like this facility, although it seems as if you
> cannot see the highlighted search term in full page view.
> 2. When viewing the scanned image of a full newspaper page in
> Greenstone, is it possible, in theory, to change the scale of the image
> ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too
> hard to build such a web-based image viewer using JavaScript, but I
> don't know whether one can incorporate this code into a Greenstone
> digital library?
> 3. If it is not possible to endow Greenstone HTML pages with the ability
> to highlight search terms directly on scanned images or to change the
> scale of images, can one instead have Greenstone open the original PDF
> files, with the help of an Acrobat viewer on the client machine? If so,
> is it possible to pre-load the search term so that it is highlighted on
> the page as soon as the viewer opens, rather than having to type in the
> term again into the Acrobat search box?
> Regards,
> Alex Kriwaczek.
> ------------------------------------------------------------------------
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users

Guillaume HATT
Informatique documentaire
Ecole nationale des chartes
19 rue de la Sorbonne
75005 Paris
Courriel : guillaume.hatt@enc.sorbonne.fr
T?l. : 01 55 42 75 05