Re: [greenstone-devel] appropriateness for a land database

From Stefan Boddie
DateThu, 1 Apr 2004 14:55:18 +1200
Subject Re: [greenstone-devel] appropriateness for a land database
In-Reply-To (C384D6278A798A4E95E94931DDA0AFE8512684-KPEX01-ksbe-edu)
Hi Helen,

> I'm very impressed with the use of Greenstone for the Hawaiian
> Electronic Library (ulukau.org). However I'm wondering if
> Greenstone can be programmed to be both an independent
> database AND portal to other sources.
>
> I've been assigned the task of developing a database including
> digitized resources, i.e. leases, photos, etc and existing databases.
> One existing database is of GIS data compiled and managed by
> MS' SQLServer. The tool I select must allow users to link easily
> from citations to full-text content. Is Greenstone appropriate for
> compiling such data and executing a cross database search?
> Looking at some examples from your webpage, it does appear to
> go beyond searching the individual digitized collections but I
> wanted to confirm.

There are two ways to do this I guess. The first is to export the data from
your MS SQLServer database and import it into Greenstone. The other way is
to search your SQL database(s) at runtime (from within Greenstone) and
attempt to combine all the results. The first is easier and would possibly
not require any alterations to Greenstone (though you may need to create a
new plugin for importing the data, depending on the format of what you
export from SQL). The second would require some additions to the Greenstone
runtime code to allow it to query your sql server, etc. That would be quite
a lot of work.

I've been working with Bob Stauffer on the ulukau.org site so let Bob or I
know if you do want to leap into customizing Greenstone for your task and
you'd like some help to do so.

> When retrieving data, can Greenstone determine relevance? Is
> there an algorithm for this?

Yes, Greenstone can rank the results of a search, based on how relevant each
document is to the search terms. I don't pretend to understand the algorithm
well - what follows is the way it was described in a previous post by Gordon
Paynter:

When you do a search for one or more terms, MG [that's the search engine
Greenstone uses] calculates the Term Frequency times Inverse Document
Frequency score for each term. (i.e. the divide the number of times a term
occurs in a document by the number of documents in which it occurs overall,
thus selecting the documents where that term is most relevant and
normalising
across term frequency in the collection.) At this point you know how
"relevant" each document is to each term, and you combine these TFIDF scores
with something called the "Cosine measure" to give an overall relevance
score for each document. Of course, when you do this for real it's a bit
more complicated, and you have to find the logarithm of a lot of numbers.
But that's all part of the long answer, which you can find in the MG book.

> I'd like to be able to import MARC records (I have yet to decide if
> I'll use MARC or Dublin Core as the base protocol).

Greenstone does have support for importing MARC records so that shouldn't be
too difficult.

> Mahalo for your time and consideration on these questions.

Good luck.

Stefan Boddie