[greenstone-users] PDF files with searchable scanned images

From Guillaume Hatt
DateSat Jun 27 01:58:19 2009
Subject [greenstone-users] PDF files with searchable scanned images
In-Reply-To (C0065BA84E23FC49A7DF8B98C81A080C01079FA005F6-NSQ165EX-enterprise-internal-city-ac-uk)

I don't have answers to all these questions, but we have built a
collection of PDF files + OCR + metadata with pdftk and Greenstone. You
can see the result at page
If you have other questions, I can try to answer.

For your questions, I think :
1. No, only on the text version (you can try on our site)
2. No, you must build it with Javascript (and I would be interested too)
3. No, the PDF files are not generated after the search.
Has anybody other answers ? I would be glad to see I'm wrong.

G. Hatt

PG-PG-Kriwaczek, Alexander a C)crit :
> Hi,
> I'm building a digital library to hold 560 scanned images of newspaper
> pages. I've completed the scanning and am about to embark on the OCR.
> With the help of OmniPage and Acrobat, I'm able to produce individual
> PDF files containing searchable high quality images, which in Acrobat
> allows search terms to be displayed directly on the scanned image, as
> described on page 182 of Witten and Bainbridge's book "How to Build a
> Digital Library". These PDF files convert successfully to HTML using the
> pdftohtml.exe application, which I believe underlies the PDF plugin.
> Before carrying out the OCR on all 560 images there are certain issues
> that I'm not entirely sure about, despite having looked at the
> documentation. I would be really grateful if someone could clarify at
> least some of the following questions so that I can have a clearer view
> of the road ahead:
> 1. If you import such a PDF file into Greenstone, will the digital
> library offer the ability to display search terms directly on the
> scanned image, as in Acrobat? I noticed that the Papers Past digital
> library has something like this facility, although it seems as if you
> cannot see the highlighted search term in full page view.
> 2. When viewing the scanned image of a full newspaper page in
> Greenstone, is it possible, in theory, to change the scale of the image
> ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too
> hard to build such a web-based image viewer using JavaScript, but I
> don't know whether one can incorporate this code into a Greenstone
> digital library?
> 3. If it is not possible to endow Greenstone HTML pages with the ability
> to highlight search terms directly on scanned images or to change the
> scale of images, can one instead have Greenstone open the original PDF
> files, with the help of an Acrobat viewer on the client machine? If so,
> is it possible to pre-load the search term so that it is highlighted on
> the page as soon as the viewer opens, rather than having to type in the
> term again into the Acrobat search box?
> Regards,
> Alex Kriwaczek.
> ------------------------------------------------------------------------
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users

Guillaume HATT
Informatique documentaire
Ecole nationale des chartes
19 rue de la Sorbonne
75005 Paris
Courriel : guillaume.hatt@enc.sorbonne.fr
TC)l. : 01 55 42 75 05
-------------- section suivante --------------
Une pi□ce jointe non texte a □t□ nettoy□e...
Nom: guillaume_hatt.vcf
Type: text/x-vcard
Taille: 380 octets
Desc: non disponible
Url: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20090626/2098f42f/guillaume_hatt.vcf