[greenstone-devel] Re: Lucene

From Katherine of Greenstone Team
DateMon Sep 15 12:38:09 2008
Subject [greenstone-devel] Re: Lucene
In-Reply-To (!-!AAAAAAAAAAAYAAAAAAAAAMqkfLKn3i1CvY3Zg+pc6XzigAAAEAAAAAqp+bBpV35DgdjMjs4pU1cBAAAAAA==-orsna-gov-ar)
Hi Diego

Thanks for pointing out this bug. To fix it you need to look for the
following method in perllib/classify.pm.

sub add_section_content {
my ($doc_obj, $cursection, $doc_db_hash) = @_;

foreach my $key (keys %$doc_db_hash) {
#don't need to store these metadata
next if $key =~
/(thistype|childtype|contains|docnum|doctype|classifytype)/i;
# but do want things like hastxt and archivedir
my @items = split /@/, $doc_db_hash->{$key};
# metadata is all from gdbm so should already be in utf8
map {$doc_obj->add_metadata ($cursection, $key, $_); } @items;

}
}

Change the add_metadata call to add_utf8_metadata.

And rebuild from scratch to reset the gdbm database. And hopefully it
should work properly next incremental build. Please let me know if it
doesn't.

Cheers,
Katherine


Diego Spano wrote:
> Hi you all:
>
> just a litle big thing about Lucene. When I first index my collection I have
> documents listed like this:
>
> Bolet□n Diario 12/08/2007
> Bolet□n Diario 13/08/2007
> Bolet□n Diario 14/08/2007
>
> All ok with accents. But when I run incremental index (perl -S buildcol.pl
> -incremental -buildir /gsdl/collect/diario/index diario) to add a new
> document, I get this result the first time:
>
> Boletín Diario 12/08/2007
> Boletín Diario 13/08/2007
> Boletín Diario 14/08/2007
> Bolet□n Diario 15/08/2007
>
> and this result the second time:
>
> Bolet□?­n Diario 12/08/2007
> Bolet□?­n Diario 13/08/2007
> Bolet□?­n Diario 14/08/2007
> Bolet□n Diario 15/08/2007
> Bolet□n Diario 16/08/2007
>
> and this result the third time:
>
> Bolet□?□?□?­n Diario 12/08/2007
> Bolet□?□?□?­n Diario 13/08/2007
> Bolet□?□?□?­n Diario 14/08/2007
> Bolet□?­n Diario 15/08/2007
> Boletín Diario 16/08/2007
> Bolet□n Diario 17/08/2007
>
> and so on. I think that in every pass, Lucene assumes that it has to
> converto to utf-8 even when the metadata is in utf-8 format from previous
> import process. It does convert-to-utf8(utf8-value)...
>
> Am I wrong?!?!?!?!?
>
> Diego
>
>
>