Re: [greenstone-users] Basic questions about Greenstone

From Mike
DateThu, 10 Feb 2005 09:04:10 +0000
Subject Re: [greenstone-users] Basic questions about Greenstone
In-Reply-To (000c01c50ece$3a9014a0$982a01c8-DIEGOS)
Thanks Diego,

That sounds very encouraging!

Mike

Diego Spano wrote:

>Mike, we have made a test in Greenstone managing 1.000.000 scanned pages
>and believe me, the search is extremaly fast (full text and browsing).
>We host greenstone on Windows Xp. Next week I will send an email with
>some statistics about the test we have done.
>
>About your questions:
>
>
>
>>1. We plan to digitise the documents to produce TIFF files. We would
>>want the user to be able to search in the full text of the docs and
>>
>>
>find
>
>
>>a list of hits, each of which takes them to the relevant page – what
>>they will see is the image (.gif or .png ??) of that page. The user
>>should also be able to navigate through the images of the pages one at
>>
>>
>a
>
>
>>time, forward and backwards.
>>
>>I assume we can implement this via Greenstone, but will it handle
>>searching possibly 900,000 pages?
>>
>>Would built-in Greenstone web server handle this volume?
>>Also I assume the full text searching could be done on the OCR (we have
>>
>>
>
>
>
>>no budget for re-keying the text)?
>>
>>
>
>Greenstone will handle this amount of files without problems. We work
>with a lot of images, all in tiff format (G4 compression). We do OCR to
>all pages (without human re-keying) and import them into Greenstone
>using PageImgPlug. You don□t need to convert to gif or png format. You
>only need to install a tiff plugin to the user to let the browser show
>the images. Try www.alternatiff.com to get a free plugin. Don□t worry
>about the images volume. What it really matters is txt volume. You can
>have 1 TB of images but perhaps only a few GB of text. What Greenstone
>will index are the text files, and I think that the limit is many GB.
>
>
>
>
>>2. As well as making this collection available on-line, some users will
>>
>>
>
>
>
>>need to be able to access it off-line, since they will have little or
>>
>>
>no
>
>
>>Internet access. Obviously a CD-ROM will be too small to hold 900,000
>>pages – can Greenstone export a collection to DVD-ROM?
>>
>>
>
>Yes, you can export to cd or dvd, but do you think that one DVD can host
>900.000 pages? Did you estimate the space needed to save all images and
>text? Are black and white images or color ones?
>
>
>
>>3. If not, the other possibility would be to place the collection on a
>>server in say 3 or more different institutions. Most would have slow or
>>
>>
>
>
>
>>no access to the Internet (based in Africa), but users at the
>>institutions could access the collection over their local network. This
>>
>>
>
>
>
>>would be OK until some new docs were to be added to the collection – is
>>
>>
>
>
>
>>there any way to manage the update of a number of identical collections
>>
>>
>
>
>
>>so that they remain identical?
>>
>>
>
>I don□t know any other method to update the indexes in different
>computers. I always copy the index folder of the collection to the
>secondary server to get both machines updated.
>
>
>
>>4. We plan to collect metadata at the time of scanning (maybe to simple
>>
>>
>
>
>
>>text files). Would there be any automated way to upload this to the
>>created collection, or would it be a matter of a very large cut and
>>paste exercise using the Librarian interface?
>>
>>
>
>My advice: don□t use gli for this job. Simply scan into folders (one
>folder for document) and inside them put a .item file (it is used by
>PageImgPlug)
>
>It is simple: put the tiff files and txt files in a folder, and inside
>it you need to create some .item files, eg doc.item.
>This .item file contains metadata for the doc and a list of the image
>and text
>files that make up the document.
>PagedImgPlug then processes these item files, linking the images into a
>single document.
>There are options to the plugin for creating thumbnails or small preview
>
>size images from the main page images.
>
>There are some brief instructions in the header of the plugin file.
>
>Note, PagedImgPlug is availablein the Greenstone 2.50 release.
>
>The format of the .item file is: first all metadata fields, and then one
>line for each image with this format:
>Page_number:image_file_name:text_file_name:
>
><Title>Document 1
><Subject>Test
><Author>Erik Andersen
> 1:F6N869510.TIF:F6N869510.txt:
> 2:F6N869511.TIF:F6N869511.txt:
> 3:F6N869512.TIF:F6N869512.txt:
> 4:F6N869513.TIF:F6N869513.txt:
> 5:F6N869514.TIF:F6N869514.txt:
> 6:F6N869515.TIF:F6N869515.txt:
> 7:F6N869516.TIF:F6N869516.txt:
> 8:F6N869517.TIF:F6N869517.txt:
> 9:F6N869518.TIF:F6N869518.txt:
> 10:F6N869519.TIF:F6N869519.txt:
>
>
>Since you are working with tiff files you need a viewer that let the
>browser display the image. Take a look at www.alternatiff.com (A TIFF
>image viewer for Windows web browsers). Then, in collect.cfg you need to
>add this line to view the images:
>
>format DocumentText
>"<center><b>[parent:Title]</b></center><br><br><br><table border=0
>align=center WIDTH=750><tr><td align=center><embed width=550 height=950
>src=_httpcollection_/index/assoc/[parent:assocfilepath]/[Image]
>type=image/tiff toolbar=top></td></table>
>[Text]"
>
>And that□s all !!!!
>
>Hope this help. Ask what you need.
>
>Diego Spano
>Archivo Digital
>Secretaria de DD. HH.
>Ministerio de Justicia y DD. HH.
>Tel: (5411)-4382-6404
>djspano@jus.gov.ar
>
>
>-----Mensaje original-----
>De: greenstone-users-bounces@list.scms.waikato.ac.nz
>[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre de
>Mike
>Enviado el: Mi□rcoles, 09 de Febrero de 2005 12:45 p.m.
>Para: greenstone-users@list.scms.waikato.ac.nz
>Asunto: [greenstone-users] Basic questions about Greenstone
>
>
>Hi,
>
>We are planning to make available a collection of approx 600,000 -
>900,000 pages of scanned documents both on-line and off-line and I was
>hoping to find out the answers to some basic questions about Greenstone,
>
>by drawing on your experiences.
>
>The likely OS for hosting the collection is Linux
>
>
>1. We plan to digitise the documents to produce TIFF files. We would
>want the user to be able to search in the full text of the docs and find
>
>a list of hits, each of which takes them to the relevant page – what
>they will see is the image (.gif or .png ??) of that page. The user
>should also be able to navigate through the images of the pages one at a
>
>time, forward and backwards.
>
>I assume we can implement this via Greenstone, but will it handle
>searching possibly 900,000 pages?
>
>Would built-in Greenstone web server handle this volume?
>
>Also I assume the full text searching could be done on the OCR (we have
>no budget for re-keying the text)?
>
>2. As well as making this collection available on-line, some users will
>need to be able to access it off-line, since they will have little or no
>
>Internet access. Obviously a CD-ROM will be too small to hold 900,000
>pages – can Greenstone export a collection to DVD-ROM?
>
>3. If not, the other possibility would be to place the collection on a
>server in say 3 or more different institutions. Most would have slow or
>no access to the Internet (based in Africa), but users at the
>institutions could access the collection over their local network. This
>would be OK until some new docs were to be added to the collection – is
>there any way to manage the update of a number of identical collections
>so that they remain identical?
>
>4. We plan to collect metadata at the time of scanning (maybe to simple
>text files). Would there be any automated way to upload this to the
>created collection, or would it be a matter of a very large cut and
>paste exercise using the Librarian interface?
>
>Regards,
>Mike Cave
>-----------------------------------------------------
>Technical Development Manager
>FORCED MIGRATION ONLINE
>Refugee Studies Centre
>University of Oxford
>Queen Elizabeth House
>21 St Giles
>Oxford OX1 3LA
>E-mail: mike.cave@qeh.ox.ac.uk
>Tel: +44-1865-270262
>Fax: +44-1865-270297
>http://www.forcedmigration.org
>
>
>_______________________________________________
>greenstone-users mailing list
>greenstone-users@list.scms.waikato.ac.nz
>https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>
>
>
>