Thanks for your help; I want to know which book has more than four digit page numbers! (Encyclopedia? LCSH? OED?) Seriously; A number of the works I am scanning and putting into our collection contain large numbers of goverment figures; Trade figures etc. and many go beyong four figures (though not much beyond 6); this is a great thing to know! (Unfortunately I will have to stop using MG++ and go back to MG) Are there any other secret flags that might be worth knowing? Also how does MG (and MG++) special characters? I have noticed that in the 'source and documentation' collection (which is an excellent resource-why is it hidden amongst test collections on the nzdl2 server) http://nzdl2.cs.waikato.ac.nz/cgi-bin/library?site=localhost&a=p&p=about&c=gsdldocs&ct=0&l=en&w=utf-8 it seems to handle characters other than the base alphnumeric ones differently; My experience is that is seems to ignore undersocres(_) when I have been trying to search .dm files. Am I doing something wrong? Thanks for the help, Cheers, Stephen > indexing. there is now a -M option to mg_passes, which specifies the max number of > digits allowed in a word. > this can be used in your collection by adding > maxnumeric 6 (or whatever) > to teh collect.cfg file.
>>Greenstone does this on purpose for indexing numbers - it breaks them up >>into 4-digit groups, otherwise page numbers etc could greatly increase >>the size of the dictionary and lead to not-so-good compression. _________________________________________________ Stephen De Gabrielle Digitisation Officer AraDA Project
Northern Territory University Library http://www.ntu.edu.au/library Tel: (08) 8946 7009 from overseas: 61 8 8946 7009 Postal address: P.O.Box 41246, Casuarina, NT, 0811, Australia CRICOS Provider No: 00300K Stefan Boddie <sjboddie@cs.waikato.ac.nz> 21/07/2003 03:27 PM CST
To: Katherine Don <kjdon@cs.waikato.ac.nz> cc: Stephen.DeGabrielle@ntu.edu.au, greenstone-devel@list.scms.waikato.ac.nz bcc: Subject: Re: [greenstone-devel] a simple patch to allow collection builders toassign a documentidentifier (OID)
Yup, set maxnumeric to some suitably large value in your collect.cfg then rebuild the collection.
Stefan.
Katherine Don wrote: > hi Stephen, > > this is done by MG - theres a maxnumeric variable, which defaults to 4. This was > originally a #define, but has been changed so that it can be specified during > indexing. there is now a -M option to mg_passes, which specifies the max number of > digits allowed in a word. > this can be used in your collection by adding > maxnumeric 6 (or whatever) > to teh collect.cfg file. > > Stefan B, is that all that needs to be done? will querying of the collection use > this maxnumeric thingy too?? > > Note, I think this is curently not available with mgpp (due to an oversight). I'll > try and stick it in at some stage - if its urgent, let me know. > > Katherine Don > > > >>>One problem; we have included our 'Barcode' metadat in the default index, >>>but when we tried to search it weirdly split search term "C10001": >>> >>>>Word count: C1000: 2, 1: 13 >>>>2 documents matched the query. >>> >>>We used the double quotes but it still split the term into 'C1000' and '1'. >>>Any ideas what went wrong here? Does MG have problems with large numbers or >>>other nontext characters? >> >>Greenstone does this on purpose for indexing numbers - it breaks them up >>into 4-digit groups, otherwise page numbers etc could greatly increase >>the size of the dictionary and lead to not-so-good compression. >> >>Unfortunately I couldn't find where this is done in the c++ code, so >>hopefully someone who knows the code better than I do can tell you where >>this happens. >> >>John >> > > > >
|