[greenstone-users] PDF files with searchable scanned images

From Guillaume Hatt
DateFri Jul 3 01:16:14 2009
Subject [greenstone-users] PDF files with searchable scanned images
In-Reply-To (C0065BA84E23FC49A7DF8B98C81A080C01079FA005FA-NSQ165EX-enterprise-internal-city-ac-uk)

I tried to import text result of OmniPage OCR too in Greenstone, but I
did not find the good method. At this time, we have in our digital library :
- PDF with OCR text produced by Omnipage (the OCR is good, but not
- Greenstone OCR of these PDF files (not so good).

So, you can search "inside the text" with Greenstone text search, and
you can download the PDF file and search inside with PDF client (Adobe,
evince, foxit and so on).

G. Hatt

PG-PG-Kriwaczek, Alexander a C)crit :
> Hi Guillaume,
> I looked at your digital library with great interest and found that your PDF files contain searchable scanned images, just as I intend to do. Using OmniPage, Ibm able to save such b
PDF searchable imageb
files as a result of the OCR process.
> The test PDF searchable image files that I have produced can be converted into HTML using the pdftohtml.exe utility. However, it does not look as if anything has been done with the hidden OCRbd text that is embedded in the PDF file. Does this mean that Greenstonebs PDF plugin would likewise ignore the hidden OCRbd text when importing the PDF files into the digital library?
> Should I, therefore, get OmniPage to save separate OCRbd text files in addition to the PDF searchable image files, and then additionally import the OCRbd text files into Greenstone? In that case, do you connect the OCRbd text to the PDF files using structural metadata?
> Regards,
> Alex Kriwaczek.
> ________________________________________
> From: Guillaume Hatt [guillaume.hatt@enc.sorbonne.fr]
> Sent: 26 June 2009 14:57
> To: PG-PG-Kriwaczek, Alexander; greenstone-users@list.scms.waikato.ac.nz
> Subject: Re: [greenstone-users] PDF files with searchable scanned images
> Hi,
> I don't have answers to all these questions, but we have built a
> collection of PDF files + OCR + metadata with pdftk and Greenstone. You
> can see the result at page
> http://bibnum.enc.sorbonne.fr/?site=localhost&a=p&p=about&c=tap&l=fr&w=utf-8
> If you have other questions, I can try to answer.
> For your questions, I think :
> 1. No, only on the text version (you can try on our site)
> 2. No, you must build it with Javascript (and I would be interested too)
> 3. No, the PDF files are not generated after the search.
> Has anybody other answers ? I would be glad to see I'm wrong.
> Regards,
> G. Hatt
> PG-PG-Kriwaczek, Alexander a C)crit :
>> Hi,
>> I'm building a digital library to hold 560 scanned images of newspaper
>> pages. I've completed the scanning and am about to embark on the OCR.
>> With the help of OmniPage and Acrobat, I'm able to produce individual
>> PDF files containing searchable high quality images, which in Acrobat
>> allows search terms to be displayed directly on the scanned image, as
>> described on page 182 of Witten and Bainbridge's book "How to Build a
>> Digital Library". These PDF files convert successfully to HTML using the
>> pdftohtml.exe application, which I believe underlies the PDF plugin.
>> Before carrying out the OCR on all 560 images there are certain issues
>> that I'm not entirely sure about, despite having looked at the
>> documentation. I would be really grateful if someone could clarify at
>> least some of the following questions so that I can have a clearer view
>> of the road ahead:
>> 1. If you import such a PDF file into Greenstone, will the digital
>> library offer the ability to display search terms directly on the
>> scanned image, as in Acrobat? I noticed that the Papers Past digital
>> library has something like this facility, although it seems as if you
>> cannot see the highlighted search term in full page view.
>> 2. When viewing the scanned image of a full newspaper page in
>> Greenstone, is it possible, in theory, to change the scale of the image
>> ('actual size', 'fit page' and 'fit width') as in Acrobat? It is not too
>> hard to build such a web-based image viewer using JavaScript, but I
>> don't know whether one can incorporate this code into a Greenstone
>> digital library?
>> 3. If it is not possible to endow Greenstone HTML pages with the ability
>> to highlight search terms directly on scanned images or to change the
>> scale of images, can one instead have Greenstone open the original PDF
>> files, with the help of an Acrobat viewer on the client machine? If so,
>> is it possible to pre-load the search term so that it is highlighted on
>> the page as soon as the viewer opens, rather than having to type in the
>> term again into the Acrobat search box?
>> Regards,
>> Alex Kriwaczek.
>> ------------------------------------------------------------------------
>> _______________________________________________
>> greenstone-users mailing list
>> greenstone-users@list.scms.waikato.ac.nz
>> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
> --
> ========================================================
> Guillaume HATT
> BibliothC)caire
> Informatique documentaire
> Ecole nationale des chartes
> 19 rue de la Sorbonne
> 75005 Paris
> Courriel : guillaume.hatt@enc.sorbonne.fr
> TC)l. : 01 55 42 75 05
> ========================================================

Guillaume HATT
Informatique documentaire
Ecole nationale des chartes
19 rue de la Sorbonne
75005 Paris
Courriel : guillaume.hatt@enc.sorbonne.fr
TC)l. : 01 55 42 75 05
-------------- section suivante --------------
Une pi?ce jointe non texte a ?t? nettoy?e...
Nom: guillaume_hatt.vcf
Type: text/x-vcard
Taille: 380 octets
Desc: non disponible
Url: https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/attachments/20090702/a190ca60/guillaume_hatt.vcf