[greenstone-users] PDF files with searchable scanned images

From PG-PG-Kriwaczek, Alexander
DateSat Jun 27 08:10:13 2009
Subject [greenstone-users] PDF files with searchable scanned images
In-Reply-To (4A44D3E6-9010808-enc-sorbonne-fr)
Hi Guillaume,

Thanks for your very helpful reply.

I now see that I'll have to display PDF files if I want to highlight the search terms because the quality of the OCR'd text is not very good and I haven't the time to proofread all 560 pages.

I found a way of launching a PDF file from a URL (looking something like a GET query) with a search term pre-loaded so that the Acrobat reader (or application) highlights the first occurrence of the term and the search pane is open, displaying all occurrences of the term. Perhaps I can get Greenstone to construct such URLs.

Regards,

Alex Kriwaczek.

________________________________________
From: Guillaume Hatt [guillaume.hatt@enc.sorbonne.fr]
Sent: 26 June 2009 14:57
To: PG-PG-Kriwaczek, Alexander; greenstone-users@list.scms.waikato.ac.nz
Subject: Re: [greenstone-users] PDF files with searchable scanned images

Hi,

I don't have answers to all these questions, but we have built a
collection of PDF files + OCR + metadata with pdftk and Greenstone. You
can see the result at page
http://bibnum.enc.sorbonne.fr/?site=localhost&a=p&p=about&c=tap&l=fr&w=utf-8
If you have other questions, I can try to answer.

For your questions, I think :
1. No, only on the text version (you can try on our site)
2. No, you must build it with Javascript (and I would be interested too)
3. No, the PDF files are not generated after the search.
Has anybody other answers ? I would be glad to see I'm wrong.

Regards,
G. Hatt

PG-PG-Kriwaczek, Alexander a □crit :
> Hi,
>
> I'm building a digital library to hold 560 scanned images of newspaper
> pages. I've completed the scanning and am about to embark on the OCR.
> With the help of OmniPage and Acrobat, I'm able to produce individual
> PDF files containing searchable high quality images, which in Acrobat
> allows search terms to be displayed directly on the scanned image, as
> described on page 182 of Witten and Bainbridge's book "How to Build a
> Digital Library". These PDF files convert successfully to HTML using the
> pdftohtml.exe application, which I believe underlies the PDF plugin.
>
> Before carrying out the OCR on all 560 images there are certain issues
> that I'm not entirely sure about, despite having looked at the
> documentation. I would be really grateful if someone could clarify at
> least some of the following questions so that I can have a clearer view
> of the road ahead:
>
> 1. If you import such a PDF file into Greenstone, will the digital
> library offer the ability to display search terms directly on the
> scanned image, as in Acrobat? I noticed that the Papers Past digital
> library has something like this facility, although it seems as if you
> cannot see the highlighted search term in full page view.
>
> 2. When viewing the scanned image of a full newspaper page in
> Greenstone, is it possible, in theory, to change the scale of the image
> ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too
> hard to build such a web-based image viewer using JavaScript, but I
> don't know whether one can incorporate this code into a Greenstone
> digital library?
>
> 3. If it is not possible to endow Greenstone HTML pages with the ability
> to highlight search terms directly on scanned images or to change the
> scale of images, can one instead have Greenstone open the original PDF
> files, with the help of an Acrobat viewer on the client machine? If so,
> is it possible to pre-load the search term so that it is highlighted on
> the page as soon as the viewer opens, rather than having to type in the
> term again into the Acrobat search box?
>
> Regards,
>
> Alex Kriwaczek.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users

--
========================================================
Guillaume HATT
Biblioth□caire
Informatique documentaire
Ecole nationale des chartes
19 rue de la Sorbonne
75005 Paris
Courriel : guillaume.hatt@enc.sorbonne.fr
T□l. : 01 55 42 75 05
========================================================