Re: [greenstone-users] Indexing bug in plain list classifier?

From Michael Dewsnip
DateWed, 15 Dec 2004 15:36:56 +1300
Subject Re: [greenstone-users] Indexing bug in plain list classifier?
In-Reply-To (41BDBB32-7090004-reltech-org)
Hi Tim,

Thanks for reporting this, it's an interesting one! When BasPlug uses
textcat to guess the language of a document, it first removes any
<title> tags -- justification (according to the comments) being that
many foreign documents have English titles. Trouble is, BasPlug assumes
that the entire <title> tag is on one line; if it isn't, as in your
<title>blah </title> case, it won't be removed.

Amusingly, with <title>Hello <title> left in my test document textcat
is convinced the document is Italian! And /this/ is why the sort is
messed up -- it seems that the list classifiers look at the document
language when sorting.

My fix to this is to change the line in BasPlug::get_language_encoding from

$text =~ s/<title>.*?</title>//i;

to

$text =~ s/<title>(.| )*?</title>//i;

This will cause the <title> tag to be removed correctly, textcat won't
think it is Italian, and it will be sorted correctly by the classifier.

All the best,

Michael

Tim Finney wrote:

> I discovered what seems to be a bug with the plain list classifier. If
> some of my web pages have <title>blah </title> (i.e. a line break
> before the closing tag), then they end up appearing at the end of the
> list rather than in their correct alphabetical place. The cure is
> simple -- remove the line break which should never have been there in
> the first place.
>
> Best
>
> Tim Finney
>
>
>
> _______________________________________________
> greenstone-users mailing list
> greenstone-users@list.scms.waikato.ac.nz
> https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users
>