hi Stephen,
this is done by MG - theres a maxnumeric variable, which defaults to 4. This was
originally a #define, but has been changed so that it can be specified during
indexing. there is now a -M option to mg_passes, which specifies the max number of
digits allowed in a word.
this can be used in your collection by adding
maxnumeric 6 (or whatever)
to teh collect.cfg file.
Stefan B, is that all that needs to be done? will querying of the collection use
this maxnumeric thingy too??
Note, I think this is curently not available with mgpp (due to an oversight). I'll
try and stick it in at some stage - if its urgent, let me know.
Katherine Don
> > One problem; we have included our 'Barcode' metadat in the default index,
> > but when we tried to search it weirdly split search term "C10001":
> > >Word count: C1000: 2, 1: 13
> > >2 documents matched the query.
> > We used the double quotes but it still split the term into 'C1000' and '1'.
> > Any ideas what went wrong here? Does MG have problems with large numbers or
> > other nontext characters?
>
> Greenstone does this on purpose for indexing numbers - it breaks them up
> into 4-digit groups, otherwise page numbers etc could greatly increase
> the size of the dictionary and lead to not-so-good compression.
>
> Unfortunately I couldn't find where this is done in the c++ code, so
> hopefully someone who knows the code better than I do can tell you where
> this happens.
>
> John
> |