Re: [greenstone-devel] a simple patch to allow collection builders to assign a documentidentifier (OID)

From John R. McPherson
DateMon, 21 Jul 2003 19:43:26 +1200
Subject Re: [greenstone-devel] a simple patch to allow collection builders to assign a documentidentifier (OID)
In-Reply-To (OFC6C150D6-04E6AECF-ON69256D6A-002723A1-69256D6A-002723BA-ntu-edu-au)
On Mon, Jul 21, 2003 at 04:37:30PM +0930, wrote:
> Hi,
> We needed the ability to assign our own unique identifiers to greenstone
> documents- in our case the 'Hash' and 'Incremental' methods of assigning
> identifiers were not suitable as we would like greenstone to use the
> identifiers created by our others systems.
> We have put together a simple patch to allow an new value for the -OIDtype
> flag when using
> to call it simply type;
> -OIDtype barcode -removeold ntlier
> and include a <Metadata name="Barcode">C10001</Metadata> for each document

If you have a custom plugin, plugins can override the Identifier for a
document - you just override the default function which calls the BasPlug
one if the plugin doesn't have it (set_OID()). Some of the plugins do this:
eg the SplitPlug adds a section identifier onto the end of the hash, the
BibTex plugin uses the reference name, and the Database plugin can use any
field out of a database.

But the above is a good idea. Someone will probably add it in :p

> One problem; we have included our 'Barcode' metadat in the default index,
> but when we tried to search it weirdly split search term "C10001":
> >Word count: C1000: 2, 1: 13
> >2 documents matched the query.
> We used the double quotes but it still split the term into 'C1000' and '1'.
> Any ideas what went wrong here? Does MG have problems with large numbers or
> other nontext characters?

Greenstone does this on purpose for indexing numbers - it breaks them up
into 4-digit groups, otherwise page numbers etc could greatly increase
the size of the dictionary and lead to not-so-good compression.

Unfortunately I couldn't find where this is done in the c++ code, so
hopefully someone who knows the code better than I do can tell you where
this happens.