Re: [greenstone-users] Newbie with general questions about Greenstone.

From George Buchanan
DateTue, 7 Oct 2003 10:42:01 +1300
Subject Re: [greenstone-users] Newbie with general questions about Greenstone.
In-Reply-To (886EF25AF8BEF64EB89A820EF84064FF41BF04-UCMAIL4)
> 1 -- Sizing issues:
> Are there any size limits (number of items, number of meta-data
> categories, etc.) applicable to Greenstone collections, either as specific
> limits, or targets that if exceeded could affect performance? What are
some
> examples of the largest collections?
...there are size limits - however, without recourse to complicated
technical explanations, collections have been built of several gigabytes
of text, and it is the size of the compressed (built) text that is the main
limiter on Greenstone collection size. There aren't any practical limits
on the number of metadata categories, and there can be billions of
documents in theory. These tend not to present any obstacle to
organising collections in practice. Personally, more common factors
that limit collection sizes tend to be the limit on the largest file
supported
by an operating system (2Gb is a common limit here) or physical hard
drive space. It is worthwhile remembering that the built text index is
actually smaller than the original files, so the original documents can be
significantly larger.

The UN collections such as the Humanity Development Library
represent some of our larger collections - each of these contains
the content of literally thousands of books.

> 2 -- Image Collections:
> It appears to me that Greenstone was originally conceived for digital
> libraries of texts, but has also been adapted to digital image
collections.
> Is this right? In this regard:
> *One of the documents referred to the "ImagePlug" being only available
> for Unix Platforms. Is this current info?
...no - but one has to ensure that the correct imaging software for the
platform is installed on your machine. On Windows, for example, one
has to download and install ImageMagic.

> *Are there any image formats supported by browsers that are not
> supported by Greenstone?
...Greenstone is pretty agnostic about the image files that it uses - they
are just data to it. We've built many collections of web pages, for
example, so commonplace web image files such as PNG, GIF or JPEG
are perfectly fine. If you wish to extract information from more
advanced image file formats for the purposes of searching, you may
have to create a plugin to get that information out of the files.

> *Has Greenstone been successfully integrated with any Image Server
> platforms, for delivering high resolution images, such as Mr. Sid, or
> TrueSpectra, or using the JPEG2000 format, for larger resolution images?
...I'm not aware of this having been done - however, quickly reviewing
those products, a technically competent person should be able to achieve
some integration - clearly I can't give a detailed reply on a general
question.

> *Our primary interest is in an Open Source solution for searchable,
> web-based collections of digital images. Are there any other open-source
> solutions (other than Greenstone) that we should review?
...that's not my field of expertise as such - I'll have to get back to you
on
that - perhaps others may be able to give good suggestions.

> 3 -- Import:
> I found some documentation about importing texts, but nothing about
> importing pre-existing databases -- such as an MS Access database, or
MySQL
> database -- although I found one mention of a Perl Plugin. Can anyone
> clarify with your experience, or provide pointers to the documentation?
...I've done this myself, in two different ways. In one case I dumped an MS
Access database to a particular text file format (tab separated) to use as
the basis for the old index.txt file (if I'm remembering my filenames
correctly).
Nowadays, one would wish to make a metadata.xml file instead, by preference.
In either case, that data is then imported by IndexPlug, one of the perl
plugins.
In the other situation, I created a bridge from Greenstone to a MySQL
database - this involved more programming, but it allowed me to keep
the data in the original database.

> 4 -- More on Import:
> We have a specific project in mind for one of our next digital
collections
> which will be a simple image collection with only a few basic indexes, and
> low-resolution images, but with a large number of items -- over 500,000.
...that's not a collection which should cause problems to Greenstone.

> We are considering contracting for off-site work, for both the image
> scanning and data entry. What would be the best format for an off-site
> vendor to use for data entry -- an MsAccess database(s), an xml file for
> each item, or both? -- so that the data could later be batch imported into
a
> Greenstone collection?
...I'd go the the XML approach - I've (practically) found problems with
MS Access in such situations, and in addition if anything goes wrong, it
is easier to write a program to read an XML file.

> And, how do we record, in this dataset created
> off-site, the file and path name for each image, so that the correct
links
> between indexes and images will also be present when batch importing the
> data and images later into Greenstone?
...you'd need to read about the Metadata.xml file, here - that would be
doing the job that you're describing.

> 5 -- Export:
> How would one export a Greenstone collection to other platforms? Are
all
> of the item data and indexes stored in GDBM as one would think of a
> 'traditional' database, so that one could 'export' records from GDBM? Or,
> is only summary collection data stored in GDBM?
...GDBM holds the metadata on a document as a XML-type record. One
could write a program to export that - or even use the one we've already
written that's in the Greenstone distribution - called db2txt...

> Suppose that after building
> a Greenstone collection, we needed to also share our records and images
with
> a consortium-based image/digital server that was not using Greenstone.
How
> could we export records, images and data in a standard way?
...there are a few options here - the images themselves would remain
in their native format in Greenstone, so they would be easy. The GDBM
database can be readily exported in an XML-esque form (see above). The
outstanding question would be how to export any plain (prose) text from
MG - but I get the impression that your collection would not include these.
One solution may be to run an OAI server (which I should get round to
completing sometime!) on your computer, but the practical issue here is
more identifying which consortium/partners were involved, and what
standards it could use. As with any index system, for Greenstone such
questions are readily answered, in principle, 'yes' - the question is how
readily available a solution for your particular standard of choice is.

I'm sorry not to be able to answer these questions in more detail at this
point - but there are numerous solutions to each of your questions, and
these decisions will only become clear as you progressively refine your
goals and knowledge.