[greenstone-devel] Re: Lucene

From David Bainbridge
DateTue Sep 2 22:47:25 2008
Subject [greenstone-devel] Re: Lucene
In-Reply-To (!-!AAAAAAAAAAAYAAAAAAAAAMqkfLKn3i1CvY3Zg+pc6XzigAAAEAAAABG248j6gRJGjTG2pjJltzABAAAAAA==-orsna-gov-ar)
Diego Spano wrote:
> Hi you all,
> I know that all of you were working to finish next GS release, but let me
> explain some things that I think are very important.
> For the last 8 o 9 months I was testing Lucene indexer in GS 2.80 because we
> need incremental building and fuzzy search, options not present in MGPP and
> MG.
> My first surprise was that someone told me that I was the first person
> trying to use Lucene with IIS. So, after many tests I decided to move to
> Linux platform where it is supposed Lucene works ok.
> Incremental import is not working. Supposse I have 4 docs in import folder,
> then run import and get 4 docs in archives. If I add 3 more docs in import
> (7 docs in total) and run import with -incremental flag, it should only
> process the last 3 docs and create 3 more folders in archive, but it process
> nothing.
> Incremental building is working very well, only new docs are added to the
> index.
> Problems begin with index and search and accented words. It is also supposed
> that Lucene should find the words even when I search without accents, but it
> is not right. Some examples:
> 1- Query string: remediacion
> Search results= 0 documents
> 2- Query string: remediaci□n
> Search results= 7 documents
> 3- Query string: remediacion~ (for fuzzy search)
> Search results= 24 documents because the fuzzy search includes the following
> terms:
> presentacion: 1, remodelaci?n: 1, renegociaci?n: 1, reubicaci?n: 1,
> recomendaci?n: 2, coordinacion: 6, realizaci?n: 7, remediaci?n: 12,
> presentaci?n: 35
> If I search remediacion, Lucene should find both "remediacion and
> remediaci□n" (this is what mgpp accentfold option does). But this is not the
> case!.
> Searching in Google about the way Lucene manage accented words I find some
> things:
> Lucene has an accent filter (ISOLatin1AccentFilter) that removes accents on
> ISO-8859-1 text. But thinking in the import process (i.e. with PDF docs), GS
> converts the original document into xml format with UTF8 codification (the
> doc.xml file in archive folders). The question is: lucene module is involved
> in the import process or only in the build process?. i think that it has
> nothing to do with import, so when it has to process doc.xml files it can□t
> apply a filter for latin chars when the input is UTF8.
> What about a UTF8AccentFilter?
> I assume thay during indexing Lucene converts all these characters to their
> unaccented version (□ -> N, □ -> a, etc.). and the same thing during query
> processing. So "□and□" become "nandu" in the index, and when the user
> searchs "□and□" the query parser translate it to "nandu" and find it in the
> index. I□m right?.
> I think that perhaps I have the problem with my Linux and GS installation. I
> tested it with Centos 5 and Fedora 8, and the same results in both.
> I done a search in two demo collections that David told me:
> http://www.greenstone.org/greenstone-video/cgi-bin/library?site=localhost&a=
> p&p=about&c=testing/spano-lucene&l=en&w=utf-8
> and
> http://www.greenstone.org/greenstone-video/cgi-bin/library?site=localhost&a=
> p&p=about&c=testing/lucenete&l=en&w=utf-8
> Try too search in the first one "funcion" and then "funci□n" in text index.
> Try to search "economia" and "econom□a" in the second one.


Thanks for your message and the helpful pointers to accent folding
support for Lucene.

Could you please clarify precisely what happens for you when searching
for "funcion" and so on in the above. For me they work as I would
expect. Please also note the two collections are build from the same
source documents.

Searching the mgpp collection for "funcion" (acent folding *off*)
produces 0 hits. Searching for "funci□n" produces 1 hit. Going to the
preferences page and switching the setting to accent folding *on*
("ignore accents" in the English interface), then going to the search
page results in 1 hit whether I use "funcion" or "funci□n" as my search

In the lucene collection searching for "funci□n" produces 1 hit and
"funcion" no hits. In the Lucene collection accents must always match.
This is why there is no provision through the preference pages to switch
accent folding on or off.

Searching for "economia" or "econom□a " in *either* collection produces
no hits for me. I not sure why you use a different query in the second
collection to the first (as I mentioned above they are build from the
same source documents). I think this might be the source of some
confusion. To the best of my knowledge, checking by hand the HTML pages
generated, there are no words at all starting with "econ" in the collection.


> MGPP is a very good engine but the main disadventage is that it has no
> incremental building support. I have very big collections with thousands of
> documents and full rebuild is very hard.
> Lucene is an extremely powerful engine but GS implementation is very weak
> for spanish users. I□m not an experienced programmer but I think the
> solution for the problems I mentioned is not so far away.
> Thanks a lot for time.
> Diego Spano