RE: [greenstone-users] Basic questions about Greenstone

From Diego Spano
DateWed, 9 Feb 2005 14:37:36 -0300
Subject RE: [greenstone-users] Basic questions about Greenstone
In-Reply-To (420A3012-7010500-qeh-ox-ac-uk)
Mike, we have made a test in Greenstone managing 1.000.000 scanned pages
and believe me, the search is extremaly fast (full text and browsing).
We host greenstone on Windows Xp. Next week I will send an email with
some statistics about the test we have done.

About your questions:

>1. We plan to digitise the documents to produce TIFF files. We would
>want the user to be able to search in the full text of the docs and
find
>a list of hits, each of which takes them to the relevant page – what
>they will see is the image (.gif or .png ??) of that page. The user
>should also be able to navigate through the images of the pages one at
a
>time, forward and backwards.
>
>I assume we can implement this via Greenstone, but will it handle
>searching possibly 900,000 pages?
>
>Would built-in Greenstone web server handle this volume?
>Also I assume the full text searching could be done on the OCR (we have

>no budget for re-keying the text)?

Greenstone will handle this amount of files without problems. We work
with a lot of images, all in tiff format (G4 compression). We do OCR to
all pages (without human re-keying) and import them into Greenstone
using PageImgPlug. You don´t need to convert to gif or png format. You
only need to install a tiff plugin to the user to let the browser show
the images. Try www.alternatiff.com to get a free plugin. Don´t worry
about the images volume. What it really matters is txt volume. You can
have 1 TB of images but perhaps only a few GB of text. What Greenstone
will index are the text files, and I think that the limit is many GB.


>2. As well as making this collection available on-line, some users will

>need to be able to access it off-line, since they will have little or
no
>Internet access. Obviously a CD-ROM will be too small to hold 900,000
>pages – can Greenstone export a collection to DVD-ROM?

Yes, you can export to cd or dvd, but do you think that one DVD can host
900.000 pages? Did you estimate the space needed to save all images and
text? Are black and white images or color ones?

>3. If not, the other possibility would be to place the collection on a
>server in say 3 or more different institutions. Most would have slow or

>no access to the Internet (based in Africa), but users at the
>institutions could access the collection over their local network. This

>would be OK until some new docs were to be added to the collection – is

>there any way to manage the update of a number of identical collections

>so that they remain identical?

I don´t know any other method to update the indexes in different
computers. I always copy the index folder of the collection to the
secondary server to get both machines updated.

>4. We plan to collect metadata at the time of scanning (maybe to simple

>text files). Would there be any automated way to upload this to the
>created collection, or would it be a matter of a very large cut and
>paste exercise using the Librarian interface?

My advice: don´t use gli for this job. Simply scan into folders (one
folder for document) and inside them put a .item file (it is used by
PageImgPlug)

It is simple: put the tiff files and txt files in a folder, and inside
it you need to create some .item files, eg doc.item.
This .item file contains metadata for the doc and a list of the image
and text
files that make up the document.
PagedImgPlug then processes these item files, linking the images into a
single document.
There are options to the plugin for creating thumbnails or small preview

size images from the main page images.

There are some brief instructions in the header of the plugin file.

Note, PagedImgPlug is availablein the Greenstone 2.50 release.

The format of the .item file is: first all metadata fields, and then one
line for each image with this format:
Page_number:image_file_name:text_file_name:

<Title>Document 1
<Subject>Test
<Author>Erik Andersen
1:F6N869510.TIF:F6N869510.txt:
2:F6N869511.TIF:F6N869511.txt:
3:F6N869512.TIF:F6N869512.txt:
4:F6N869513.TIF:F6N869513.txt:
5:F6N869514.TIF:F6N869514.txt:
6:F6N869515.TIF:F6N869515.txt:
7:F6N869516.TIF:F6N869516.txt:
8:F6N869517.TIF:F6N869517.txt:
9:F6N869518.TIF:F6N869518.txt:
10:F6N869519.TIF:F6N869519.txt:


Since you are working with tiff files you need a viewer that let the
browser display the image. Take a look at www.alternatiff.com (A TIFF
image viewer for Windows web browsers). Then, in collect.cfg you need to
add this line to view the images:

format DocumentText
"<center><b>[parent:Title]</b></center><br><br><br><table border=0
align=center WIDTH=750><tr><td align=center><embed width=550 height=950
src=_httpcollection_/index/assoc/[parent:assocfilepath]/[Image]
type=image/tiff toolbar=top></td></table>
[Text]"

And that´s all !!!!

Hope this help. Ask what you need.

Diego Spano
Archivo Digital
Secretaria de DD. HH.
Ministerio de Justicia y DD. HH.
Tel: (5411)-4382-6404
djspano@jus.gov.ar


-----Mensaje original-----
De: greenstone-users-bounces@list.scms.waikato.ac.nz
[mailto:greenstone-users-bounces@list.scms.waikato.ac.nz] En nombre de
Mike
Enviado el: Miércoles, 09 de Febrero de 2005 12:45 p.m.
Para: greenstone-users@list.scms.waikato.ac.nz
Asunto: [greenstone-users] Basic questions about Greenstone


Hi,

We are planning to make available a collection of approx 600,000 -
900,000 pages of scanned documents both on-line and off-line and I was
hoping to find out the answers to some basic questions about Greenstone,

by drawing on your experiences.

The likely OS for hosting the collection is Linux


1. We plan to digitise the documents to produce TIFF files. We would
want the user to be able to search in the full text of the docs and find

a list of hits, each of which takes them to the relevant page – what
they will see is the image (.gif or .png ??) of that page. The user
should also be able to navigate through the images of the pages one at a

time, forward and backwards.

I assume we can implement this via Greenstone, but will it handle
searching possibly 900,000 pages?

Would built-in Greenstone web server handle this volume?

Also I assume the full text searching could be done on the OCR (we have
no budget for re-keying the text)?

2. As well as making this collection available on-line, some users will
need to be able to access it off-line, since they will have little or no

Internet access. Obviously a CD-ROM will be too small to hold 900,000
pages – can Greenstone export a collection to DVD-ROM?

3. If not, the other possibility would be to place the collection on a
server in say 3 or more different institutions. Most would have slow or
no access to the Internet (based in Africa), but users at the
institutions could access the collection over their local network. This
would be OK until some new docs were to be added to the collection – is
there any way to manage the update of a number of identical collections
so that they remain identical?

4. We plan to collect metadata at the time of scanning (maybe to simple
text files). Would there be any automated way to upload this to the
created collection, or would it be a matter of a very large cut and
paste exercise using the Librarian interface?

Regards,
Mike Cave
-----------------------------------------------------
Technical Development Manager
FORCED MIGRATION ONLINE
Refugee Studies Centre
University of Oxford
Queen Elizabeth House
21 St Giles
Oxford OX1 3LA
E-mail: mike.cave@qeh.ox.ac.uk
Tel: +44-1865-270262
Fax: +44-1865-270297
http://www.forcedmigration.org


_______________________________________________
greenstone-users mailing list
greenstone-users@list.scms.waikato.ac.nz
https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users