Re: [greenstone-devel] a simple patch to allow collection builders toassign a documentidentifier (OID)

From Katherine Don
DateTue, 22 Jul 2003 09:19:39 +1200
Subject Re: [greenstone-devel] a simple patch to allow collection builders toassign a documentidentifier (OID)
In-Reply-To (20030721074326-GA3863-wesson-cs-waikato-ac-nz)
hi Stephen,

this is done by MG - theres a maxnumeric variable, which defaults to 4. This was
originally a #define, but has been changed so that it can be specified during
indexing. there is now a -M option to mg_passes, which specifies the max number of
digits allowed in a word.
this can be used in your collection by adding
maxnumeric 6 (or whatever)
to teh collect.cfg file.

Stefan B, is that all that needs to be done? will querying of the collection use
this maxnumeric thingy too??

Note, I think this is curently not available with mgpp (due to an oversight). I'll
try and stick it in at some stage - if its urgent, let me know.

Katherine Don


> > One problem; we have included our 'Barcode' metadat in the default index,
> > but when we tried to search it weirdly split search term "C10001":
> > >Word count: C1000: 2, 1: 13
> > >2 documents matched the query.
> > We used the double quotes but it still split the term into 'C1000' and '1'.
> > Any ideas what went wrong here? Does MG have problems with large numbers or
> > other nontext characters?
>
> Greenstone does this on purpose for indexing numbers - it breaks them up
> into 4-digit groups, otherwise page numbers etc could greatly increase
> the size of the dictionary and lead to not-so-good compression.
>
> Unfortunately I couldn't find where this is done in the c++ code, so
> hopefully someone who knows the code better than I do can tell you where
> this happens.
>
> John
>