From | Stephen.DeGabrielle64;ntu.edu.au |
Date | Tue, 8 Apr 2003 09:09:36 +0930 |
Subject | Re: Paper documents |
Hi Diego, This is what we are doing; Our intermediate solution is to encode the documents as PDF (Omnipage has a [Save as] 'PDF with image on [invisible] text' option. - this gives you a searchable collection of documents with page images - which you can do either (or both) of two things; 1. 'Page Images' (and PDFs' if you want); you can use the '-use_sections 1' switch and show the page images and uncorrected text - be careful this only works when your original page images are JPG - greenstone will put the page image first and the uncorrected text underneath- if your original image was TIFF then you will get no image- just yucky uncorrected text. This option gives you a 'browsable' document in the greenstone page turning interface with the highlighted search terms underneath. The users can also download the whole document as a PDF if you so configure it. 2. 'PDF only '; you can use the 'no_text true' option and just refer to the source file in your format statements --> [srclink][srcicon][/srclink] - the downside of this option is that your users have to download a fair sized PDF and don't get sent to the exact page of the PDF (well I don't know how to reference a particular page of a PDF) - the upside is that the Greenstone Digital Library Software does tell you what page to look on and Acrobat is a reasonable page turner We are currently using option 2 above, as it has let us get our first batch of documents working relatively quickly. (you must change the pdftohtml.pl file so it passes a '-hidden' flag to the pdftohtml utility or you will get no text and the build process will fail.) We also create our metadata by hand in metadata.xml files.
In the longer term we plan to automate the process as much as possible with the archiving of the converted text as XML[1][2] and the images as uncompressed tiff. The TEI-lite XML with be created by a XSLT stylesheet that will convert from the the non-standard Omnipage XML. For the TIFF images we plan to use imagmagick to convert to jpg viewing on the web via the greenstone DL software. We have a TEIPlug which should hopefully do the job of allowing the importing of the documents with little some modifications to reference the new page images. So far the XSLT is working - though I only have it converting to HTML - as I have not yet attacked the problem of getting the Plugin to work properly. We also plan to harvest our metadata from the ILMS(dynix). We have chosen TEI as the archival format for the text as it allows us reference the orginal page images in a standard way (something which would be difficult in HTML) and uncompressed tiff as it seems to be a relatively popular archival format for libraries - with a wide variety of tools that work with it. I hope this helps. Regards,
Stephen
[1] TEI-lite: http://www.tei-c.org/Lite/ [2] Encoding to Level 1 standard as per the 'TEI Text Encoding in Libraries Guidelines for Best Encoding Practices' http://www.diglib.org/standards/tei.htm _________________________________________________ I have something that I don´t know how to resolve using Greenstone. I´m working at Human Rights Secretary in Argentina but I also work in association with several NGO. We focus on files that belong to people who were kidnapped in the 70s and 80s. We have lot of paper documents that we want to organize and Greenstone seems to be a very useful tool. |