[greenstone-devel] Re: Lucene

From Katherine of Greenstone Team
DateMon Sep 1 10:46:43 2008
Subject [greenstone-devel] Re: Lucene
In-Reply-To (48BA89E8-2040307-cs-waikato-ac-nz)
Hi Diego

As David has said, accentfolding has not been implemented for Lucene,
only for mgpp.
our incremental import is still very basic, but should do what you have
described. I'll have a look at it. We were hoping to make the
incremental import much better for this release, but have run out of
time to do that. But at least it should do what it used to do, which is
exactly the situation you are describing.


David Bainbridge wrote:
> Diego Spano wrote:
>> Hi you all,
>> I know that all of you were working to finish next GS release, but
>> let me
>> explain some things that I think are very important.
>> For the last 8 o 9 months I was testing Lucene indexer in GS 2.80
>> because we
>> need incremental building and fuzzy search, options not present in
>> MGPP and
>> MG.
>> My first surprise was that someone told me that I was the first person
>> trying to use Lucene with IIS. So, after many tests I decided to move to
>> Linux platform where it is supposed Lucene works ok.
>> Incremental import is not working. Supposse I have 4 docs in import
>> folder,
>> then run import and get 4 docs in archives. If I add 3 more docs in
>> import
>> (7 docs in total) and run import with -incremental flag, it should only
>> process the last 3 docs and create 3 more folders in archive, but it
>> process
>> nothing.
>> Incremental building is working very well, only new docs are added to
>> the
>> index.
>> Problems begin with index and search and accented words. It is also
>> supposed
>> that Lucene should find the words even when I search without accents,
>> but it
>> is not right. Some examples:
>> 1- Query string: remediacion
>> Search results= 0 documents
>> 2- Query string: remediaci□n
>> Search results= 7 documents
>> 3- Query string: remediacion~ (for fuzzy search)
>> Search results= 24 documents because the fuzzy search includes the
>> following
>> terms:
>> presentacion: 1, remodelaci?n: 1, renegociaci?n: 1, reubicaci?n: 1,
>> recomendaci?n: 2, coordinacion: 6, realizaci?n: 7, remediaci?n: 12,
>> presentaci?n: 35
>> If I search remediacion, Lucene should find both "remediacion and
>> remediaci□n" (this is what mgpp accentfold option does). But this is
>> not the
>> case!.
>> Searching in Google about the way Lucene manage accented words I find
>> some
>> things:
>> Lucene has an accent filter (ISOLatin1AccentFilter) that removes
>> accents on
>> ISO-8859-1 text. But thinking in the import process (i.e. with PDF
>> docs), GS
>> converts the original document into xml format with UTF8 codification
>> (the
>> doc.xml file in archive folders). The question is: lucene module is
>> involved
>> in the import process or only in the build process?. i think that it has
>> nothing to do with import, so when it has to process doc.xml files it
>> can□t
>> apply a filter for latin chars when the input is UTF8.
>> What about a UTF8AccentFilter?
>> I assume thay during indexing Lucene converts all these characters to
>> their
>> unaccented version (□ -> N, □ -> a, etc.). and the same thing during
>> query
>> processing. So "□and□" become "nandu" in the index, and when the user
>> searchs "□and□" the query parser translate it to "nandu" and find it
>> in the
>> index. I□m right?.
>> I think that perhaps I have the problem with my Linux and GS
>> installation. I
>> tested it with Centos 5 and Fedora 8, and the same results in both.
>> I done a search in two demo collections that David told me:
>> http://www.greenstone.org/greenstone-video/cgi-bin/library?site=localhost&a=
>> p&p=about&c=testing/spano-lucene&l=en&w=utf-8
>> and
>> http://www.greenstone.org/greenstone-video/cgi-bin/library?site=localhost&a=
>> p&p=about&c=testing/lucenete&l=en&w=utf-8
>> Try too search in the first one "funcion" and then "funci□n" in text
>> index.
>> Try to search "economia" and "econom□a" in the second one.
> Diego,
> Thanks for your message and the helpful pointers to accent folding
> support for Lucene.
> Could you please clarify precisely what happens for you when searching
> for "funcion" and so on in the above. For me they work as I would
> expect. Please also note the two collections are build from the same
> source documents.
> Searching the mgpp collection for "funcion" (acent folding *off*)
> produces 0 hits. Searching for "funci□n" produces 1 hit. Going to
> the preferences page and switching the setting to accent folding *on*
> ("ignore accents" in the English interface), then going to the search
> page results in 1 hit whether I use "funcion" or "funci□n" as my
> search term.
> In the lucene collection searching for "funci□n" produces 1 hit and
> "funcion" no hits. In the Lucene collection accents must always
> match. This is why there is no provision through the preference pages
> to switch accent folding on or off.
> Searching for "economia" or "econom□a " in *either* collection
> produces no hits for me. I not sure why you use a different query in
> the second collection to the first (as I mentioned above they are
> build from the same source documents). I think this might be the
> source of some confusion. To the best of my knowledge, checking by
> hand the HTML pages generated, there are no words at all starting with
> "econ" in the collection.
> David.
>> MGPP is a very good engine but the main disadventage is that it has no
>> incremental building support. I have very big collections with
>> thousands of
>> documents and full rebuild is very hard.
>> Lucene is an extremely powerful engine but GS implementation is very
>> weak
>> for spanish users. I□m not an experienced programmer but I think the
>> solution for the problems I mentioned is not so far away.
>> Thanks a lot for time.
>> Diego Spano