Re: Paper documents

From Stephen.DeGabrielle@ntu.edu.au
DateTue, 8 Apr 2003 09:09:36 +0930
Subject Re: Paper documents

Hi Diego,

This is what we are doing;

Our intermediate solution is to encode the documents as PDF (Omnipage has a [Save as] 'PDF with image on [invisible] text' option.

- this gives you a searchable collection of documents with page images - which you can do either (or both) of two things;

1. 'Page Images' (and PDFs' if you want); you can use the '-use_sections 1' switch and show the page images and uncorrected text - be careful this only works when your original page images are JPG - greenstone will put the page image first and the uncorrected text underneath- if your original image was TIFF then you will get no image- just yucky uncorrected text.  This option gives you a 'browsable' document in the greenstone page turning interface with the highlighted search terms underneath. The users can also download the whole document as a PDF if you so configure it.

2. 'PDF only '; you can use the 'no_text true' option and just refer to the source file in your format statements --> [srclink][srcicon][/srclink]

  - the downside of this option is that your users have to download a fair sized PDF and don't get sent to the exact page of the PDF (well I don't know how to reference a particular page of a PDF)

  - the upside is that the Greenstone Digital Library Software does tell you what page to look on and Acrobat is a reasonable page turner

We are currently using option 2 above, as it has let us get our first batch of documents working relatively quickly.  (you must change the pdftohtml.pl file so it passes a '-hidden' flag to the pdftohtml utility or you will get no text and the build process will fail.)

We also create our metadata by hand in metadata.xml files.

 

In the longer term we plan to automate the process as much as possible with the archiving of the converted text as XML[1][2] and the images as uncompressed tiff.

The TEI-lite XML with be created by a XSLT stylesheet that will convert from the the non-standard Omnipage XML.

For the TIFF images we plan to use imagmagick to convert to jpg viewing on the web via the greenstone DL software.

We have a TEIPlug which should hopefully do the job of allowing the importing of the documents with little some modifications to reference the new page images.

So far the XSLT is working - though I only have it converting to HTML - as I have not yet attacked the problem of getting the Plugin to work properly.

We also plan to harvest our metadata from the ILMS(dynix).

We have chosen TEI as the archival format for the text as it allows us reference the orginal page images in a standard way (something which would be difficult in HTML) and uncompressed tiff as it seems to be a relatively popular archival format for libraries - with a wide variety of tools that work with it.

I hope this helps.

Regards,

 

Stephen

 

[1] TEI-lite: http://www.tei-c.org/Lite/

[2] Encoding to Level 1 standard as per the 'TEI Text Encoding in Libraries Guidelines for Best Encoding Practices' http://www.diglib.org/standards/tei.htm

_________________________________________________
Stephen De Gabrielle
Digitisation Officer
AraDA Project

Northern Territory University Library
http://www.ntu.edu.au/library
Tel: (08) 8946 7009 from overseas: 61 8 8946 7009
Postal address: P.O.Box 41246, Casuarina, NT, 0811, Australia
CRICOS Provider No: 00300K

 
"Diego Spano" <djspano@jus.gov.ar>
Sent by: owner-tripath-users@colosys.net
07/04/2003 03:33 PM

To: "John R. McPherson" <jrm21@cs.waikato.ac.nz>
cc: "Greenstone List" <greenstone@colosys.net>
bcc:
Subject: Paper documents


I have something that I don´t know how to resolve using Greenstone. I´m working at Human Rights Secretary in Argentina but I also work in association with several NGO. We focus on files that belong to people who were kidnapped in the 70s and 80s. We have lot of paper documents that we want to organize and Greenstone seems to be a very useful tool.

The case is the following:

suppose you have a record (aprox. 50 paper pages) belonging to "Diego Spano". This file was scanned so we have 50 tif files and 50 txt files (OCR). Text files are used only to search and to get the image from where it was recognized. Since the images have stamps and signatures, it should be visualized once made the search on the text.
How can we organize the collection to let the users search over the text files and viewing the associated image file?
How can we make the document "browseable"? I mean, go to the first page, the last, previous and next once the user displays a page from the results list?.

Any recommendation?

Thanks a lot.

Lic. Diego J. Spano
djspano@jus.gov.ar