|Date||Tue Sep 2 22:47:25 2008|
|Subject||[greenstone-devel] Re: Lucene|
|Diego Spano wrote:
> Hi you all,
> I know that all of you were working to finish next GS release, but let me
> explain some things that I think are very important.
> For the last 8 o 9 months I was testing Lucene indexer in GS 2.80 because we
> need incremental building and fuzzy search, options not present in MGPP and
> My first surprise was that someone told me that I was the first person
> trying to use Lucene with IIS. So, after many tests I decided to move to
> Linux platform where it is supposed Lucene works ok.
> Incremental import is not working. Supposse I have 4 docs in import folder,
> then run import and get 4 docs in archives. If I add 3 more docs in import
> (7 docs in total) and run import with -incremental flag, it should only
> process the last 3 docs and create 3 more folders in archive, but it process
> Incremental building is working very well, only new docs are added to the
> Problems begin with index and search and accented words. It is also supposed
> that Lucene should find the words even when I search without accents, but it
> is not right. Some examples:
> 1- Query string: remediacion
> Search results= 0 documents
> 2- Query string: remediaci□n
> Search results= 7 documents
> 3- Query string: remediacion~ (for fuzzy search)
> Search results= 24 documents because the fuzzy search includes the following
> presentacion: 1, remodelaci?n: 1, renegociaci?n: 1, reubicaci?n: 1,
> recomendaci?n: 2, coordinacion: 6, realizaci?n: 7, remediaci?n: 12,
> presentaci?n: 35
> If I search remediacion, Lucene should find both "remediacion and
> remediaci□n" (this is what mgpp accentfold option does). But this is not the
> Searching in Google about the way Lucene manage accented words I find some
> Lucene has an accent filter (ISOLatin1AccentFilter) that removes accents on
> ISO-8859-1 text. But thinking in the import process (i.e. with PDF docs), GS
> converts the original document into xml format with UTF8 codification (the
> doc.xml file in archive folders). The question is: lucene module is involved
> in the import process or only in the build process?. i think that it has
> nothing to do with import, so when it has to process doc.xml files it can□t
> apply a filter for latin chars when the input is UTF8.
> What about a UTF8AccentFilter?
> I assume thay during indexing Lucene converts all these characters to their
> unaccented version (□ -> N, □ -> a, etc.). and the same thing during query
> processing. So "□and□" become "nandu" in the index, and when the user
> searchs "□and□" the query parser translate it to "nandu" and find it in the
> index. I□m right?.
> I think that perhaps I have the problem with my Linux and GS installation. I
> tested it with Centos 5 and Fedora 8, and the same results in both.
> I done a search in two demo collections that David told me:
> Try too search in the first one "funcion" and then "funci□n" in text index.
> Try to search "economia" and "econom□a" in the second one.
Thanks for your message and the helpful pointers to accent folding
Could you please clarify precisely what happens for you when searching
Searching the mgpp collection for "funcion" (acent folding *off*)
In the lucene collection searching for "funci□n" produces 1 hit and
Searching for "economia" or "econom□a " in *either* collection produces
> MGPP is a very good engine but the main disadventage is that it has no