[greenstone-devel] maxnumeric: WAS 'a simple patch ...'

From Stephen.DeGabrielle@ntu.edu.au
DateTue, 22 Jul 2003 08:52:09 +0930
Subject [greenstone-devel] maxnumeric: WAS 'a simple patch ...'

Thanks for your help; I want to know which book has more than four digit page numbers! (Encyclopedia? LCSH? OED?)

Seriously; A number of the works I am scanning and putting into our collection contain large numbers of goverment figures; Trade figures etc. and many go beyong four figures (though not much beyond 6); this is a great thing to know! (Unfortunately I will have to stop using MG++ and go back to MG)

Are there any other secret flags that might be worth knowing?

Also how does MG (and MG++) special characters?

I have noticed that in the 'source and documentation' collection (which is an excellent resource-why is it hidden amongst test collections on the nzdl2 server)

http://nzdl2.cs.waikato.ac.nz/cgi-bin/library?site=localhost&a=p&p=about&c=gsdldocs&ct=0&l=en&w=utf-8

it seems to handle characters other than the base alphnumeric ones differently;

My experience is that is seems to ignore undersocres(_) when I have been trying to search .dm files. Am I doing something wrong?

 

Thanks for the help,

Cheers,

Stephen

 

 


> indexing. there is now a -M option to mg_passes, which specifies the max number of
> digits allowed in a word.
> this can be used in your collection by adding
> maxnumeric 6 (or whatever)
> to teh collect.cfg file.

>>Greenstone does this on purpose for indexing numbers - it breaks them up
>>into 4-digit groups, otherwise page numbers etc could greatly increase
>>the size of the dictionary and lead to not-so-good compression.

 

_________________________________________________
Stephen De Gabrielle
Digitisation Officer
AraDA Project

Northern Territory University Library
http://www.ntu.edu.au/library
Tel: (08) 8946 7009 from overseas: 61 8 8946 7009
Postal address: P.O.Box 41246, Casuarina, NT, 0811, Australia
CRICOS Provider No: 00300K

 
Stefan Boddie <sjboddie@cs.waikato.ac.nz>
21/07/2003 03:27 PM CST

To: Katherine Don <kjdon@cs.waikato.ac.nz>
cc: Stephen.DeGabrielle@ntu.edu.au, greenstone-devel@list.scms.waikato.ac.nz
bcc:
Subject: Re: [greenstone-devel] a simple patch to allow collection builders toassign a documentidentifier (OID)


Yup, set maxnumeric to some suitably large value in your collect.cfg
then rebuild the collection.

Stefan.

Katherine Don wrote:
> hi Stephen,
>
> this is done by MG - theres a maxnumeric variable, which defaults to 4. This was
> originally a #define, but has been changed so that it can be specified during
> indexing. there is now a -M option to mg_passes, which specifies the max number of
> digits allowed in a word.
> this can be used in your collection by adding
> maxnumeric 6 (or whatever)
> to teh collect.cfg file.
>
> Stefan B, is that all that needs to be done? will querying of the collection use
> this maxnumeric thingy too??
>
> Note, I think this is curently not available with mgpp (due to an oversight). I'll
> try and stick it in at some stage - if its urgent, let me know.
>
> Katherine Don
>
>
>
>>>One problem; we have included our 'Barcode' metadat in the default index,
>>>but when we tried to search it weirdly split search term "C10001":
>>>
>>>>Word count: C1000: 2, 1: 13
>>>>2 documents matched the query.
>>>
>>>We used the double quotes but it still split the term into 'C1000' and '1'.
>>>Any ideas what went wrong here? Does MG have problems with large numbers or
>>>other nontext characters?
>>
>>Greenstone does this on purpose for indexing numbers - it breaks them up
>>into 4-digit groups, otherwise page numbers etc could greatly increase
>>the size of the dictionary and lead to not-so-good compression.
>>
>>Unfortunately I couldn't find where this is done in the c++ code, so
>>hopefully someone who knows the code better than I do can tell you where
>>this happens.
>>
>>John
>>
>
>
>
>