| From | David Bainbridge |
| Date | Tue Sep 2 22:47:25 2008 |
| Subject | [greenstone-devel] Re: Lucene |
| In-Reply-To | (!-!AAAAAAAAAAAYAAAAAAAAAMqkfLKn3i1CvY3Zg+pc6XzigAAAEAAAABG248j6gRJGjTG2pjJltzABAAAAAA==-orsna-gov-ar) |
| Diego Spano wrote:
> Hi you all, > > I know that all of you were working to finish next GS release, but let me > explain some things that I think are very important. > > For the last 8 o 9 months I was testing Lucene indexer in GS 2.80 because we > need incremental building and fuzzy search, options not present in MGPP and > MG. > > My first surprise was that someone told me that I was the first person > trying to use Lucene with IIS. So, after many tests I decided to move to > Linux platform where it is supposed Lucene works ok. > > Incremental import is not working. Supposse I have 4 docs in import folder, > then run import and get 4 docs in archives. If I add 3 more docs in import > (7 docs in total) and run import with -incremental flag, it should only > process the last 3 docs and create 3 more folders in archive, but it process > nothing. > > Incremental building is working very well, only new docs are added to the > index. > > Problems begin with index and search and accented words. It is also supposed > that Lucene should find the words even when I search without accents, but it > is not right. Some examples: > > 1- Query string: remediacion > Search results= 0 documents > > 2- Query string: remediaci□n > Search results= 7 documents > > 3- Query string: remediacion~ (for fuzzy search) > Search results= 24 documents because the fuzzy search includes the following > terms: > presentacion: 1, remodelaci?n: 1, renegociaci?n: 1, reubicaci?n: 1, > recomendaci?n: 2, coordinacion: 6, realizaci?n: 7, remediaci?n: 12, > presentaci?n: 35 > > If I search remediacion, Lucene should find both "remediacion and > remediaci□n" (this is what mgpp accentfold option does). But this is not the > case!. > > Searching in Google about the way Lucene manage accented words I find some > things: > > Lucene has an accent filter (ISOLatin1AccentFilter) that removes accents on > ISO-8859-1 text. But thinking in the import process (i.e. with PDF docs), GS > converts the original document into xml format with UTF8 codification (the > doc.xml file in archive folders). The question is: lucene module is involved > in the import process or only in the build process?. i think that it has > nothing to do with import, so when it has to process doc.xml files it can□t > apply a filter for latin chars when the input is UTF8. > > What about a UTF8AccentFilter? > > I assume thay during indexing Lucene converts all these characters to their > unaccented version (□ -> N, □ -> a, etc.). and the same thing during query > processing. So "□and□" become "nandu" in the index, and when the user > searchs "□and□" the query parser translate it to "nandu" and find it in the > index. I□m right?. > > I think that perhaps I have the problem with my Linux and GS installation. I > tested it with Centos 5 and Fedora 8, and the same results in both. > > I done a search in two demo collections that David told me: > > http://www.greenstone.org/greenstone-video/cgi-bin/library?site=localhost&a= > p&p=about&c=testing/spano-lucene&l=en&w=utf-8 > > and > > http://www.greenstone.org/greenstone-video/cgi-bin/library?site=localhost&a= > p&p=about&c=testing/lucenete&l=en&w=utf-8 > > Try too search in the first one "funcion" and then "funci□n" in text index. > Try to search "economia" and "econom□a" in the second one. > > Diego, Thanks for your message and the helpful pointers to accent folding
Could you please clarify precisely what happens for you when searching
Searching the mgpp collection for "funcion" (acent folding *off*)
In the lucene collection searching for "funci□n" produces 1 hit and
Searching for "economia" or "econom□a " in *either* collection produces
David. > MGPP is a very good engine but the main disadventage is that it has no
| |