[greenstone-devel] RE: Lucene

From Diego Spano
DateTue Sep 2 01:38:55 2008
Subject [greenstone-devel] RE: Lucene
In-Reply-To (48BB1F48-8070701-cs-waikato-ac-nz)

Is it hard to implement accenfolding for Lucene?. Can I have the hope to get
it in a future and soon release?. I have 2 pending implementations (one of
them to manage thousands of images with ocr) and I need Lucene incremental
building, but if it will no handle accents I have to use mgpp.



-----Mensaje original-----
De: Katherine of Greenstone Team [mailto:greenstone_team@cs.waikato.ac.nz]
Enviado el: domingo, 31 de agosto de 2008 19:47
Para: Diego Spano
CC: David Bainbridge; greenstone-devel@list.scms.waikato.ac.nz
Asunto: Re: Lucene

Hi Diego

As David has said, accentfolding has not been implemented for Lucene, only
for mgpp.
our incremental import is still very basic, but should do what you have
described. I'll have a look at it. We were hoping to make the incremental
import much better for this release, but have run out of time to do that.
But at least it should do what it used to do, which is exactly the situation
you are describing.


David Bainbridge wrote:
> Diego Spano wrote:
>> Hi you all,
>> I know that all of you were working to finish next GS release, but
>> let me explain some things that I think are very important.
>> For the last 8 o 9 months I was testing Lucene indexer in GS 2.80
>> because we need incremental building and fuzzy search, options not
>> present in MGPP and MG.
>> My first surprise was that someone told me that I was the first
>> person trying to use Lucene with IIS. So, after many tests I decided
>> to move to Linux platform where it is supposed Lucene works ok.
>> Incremental import is not working. Supposse I have 4 docs in import
>> folder, then run import and get 4 docs in archives. If I add 3 more
>> docs in import
>> (7 docs in total) and run import with -incremental flag, it should
>> only process the last 3 docs and create 3 more folders in archive,
>> but it process nothing.
>> Incremental building is working very well, only new docs are added to
>> the index.
>> Problems begin with index and search and accented words. It is also
>> supposed that Lucene should find the words even when I search without
>> accents, but it is not right. Some examples:
>> 1- Query string: remediacion
>> Search results= 0 documents
>> 2- Query string: remediaci□n
>> Search results= 7 documents
>> 3- Query string: remediacion~ (for fuzzy search) Search results= 24
>> documents because the fuzzy search includes the following
>> terms:
>> presentacion: 1, remodelaci?n: 1, renegociaci?n: 1, reubicaci?n: 1,
>> recomendaci?n: 2, coordinacion: 6, realizaci?n: 7, remediaci?n: 12,
>> presentaci?n: 35
>> If I search remediacion, Lucene should find both "remediacion and
>> remediaci□n" (this is what mgpp accentfold option does). But this is
>> not the case!.
>> Searching in Google about the way Lucene manage accented words I find
>> some
>> things:
>> Lucene has an accent filter (ISOLatin1AccentFilter) that removes
>> accents on
>> ISO-8859-1 text. But thinking in the import process (i.e. with PDF
>> docs), GS converts the original document into xml format with UTF8
>> codification (the doc.xml file in archive folders). The question is:
>> lucene module is involved in the import process or only in the build
>> process?. i think that it has nothing to do with import, so when it
>> has to process doc.xml files it can□t apply a filter for latin chars
>> when the input is UTF8.
>> What about a UTF8AccentFilter?
>> I assume thay during indexing Lucene converts all these characters to
>> their unaccented version (□ -> N, □ -> a, etc.). and the same thing
>> during query processing. So "□and□" become "nandu" in the index, and
>> when the user searchs "□and□" the query parser translate it to
>> "nandu" and find it in the index. I□m right?.
>> I think that perhaps I have the problem with my Linux and GS
>> installation. I tested it with Centos 5 and Fedora 8, and the same
>> results in both.
>> I done a search in two demo collections that David told me:
>> http://www.greenstone.org/greenstone-video/cgi-bin/library?site=local
>> host&a=
>> p&p=about&c=testing/spano-lucene&l=en&w=utf-8
>> and
>> http://www.greenstone.org/greenstone-video/cgi-bin/library?site=local
>> host&a=
>> p&p=about&c=testing/lucenete&l=en&w=utf-8
>> Try too search in the first one "funcion" and then "funci□n" in text
>> index.
>> Try to search "economia" and "econom□a" in the second one.
> Diego,
> Thanks for your message and the helpful pointers to accent folding
> support for Lucene.
> Could you please clarify precisely what happens for you when searching
> for "funcion" and so on in the above. For me they work as I would
> expect. Please also note the two collections are build from the same
> source documents.
> Searching the mgpp collection for "funcion" (acent folding *off*)
> produces 0 hits. Searching for "funci□n" produces 1 hit. Going to
> the preferences page and switching the setting to accent folding *on*
> ("ignore accents" in the English interface), then going to the search
> page results in 1 hit whether I use "funcion" or "funci□n" as my
> search term.
> In the lucene collection searching for "funci□n" produces 1 hit and
> "funcion" no hits. In the Lucene collection accents must always
> match. This is why there is no provision through the preference pages
> to switch accent folding on or off.
> Searching for "economia" or "econom□a " in *either* collection
> produces no hits for me. I not sure why you use a different query in
> the second collection to the first (as I mentioned above they are
> build from the same source documents). I think this might be the
> source of some confusion. To the best of my knowledge, checking by
> hand the HTML pages generated, there are no words at all starting with
> "econ" in the collection.
> David.
>> MGPP is a very good engine but the main disadventage is that it has
>> no incremental building support. I have very big collections with
>> thousands of documents and full rebuild is very hard.
>> Lucene is an extremely powerful engine but GS implementation is very
>> weak for spanish users. I□m not an experienced programmer but I think
>> the solution for the problems I mentioned is not so far away.
>> Thanks a lot for time.
>> Diego Spano