Thanks for reporting this, it's an interesting one! When BasPlug uses
textcat to guess the language of a document, it first removes any
<title> tags -- justification (according to the comments) being that
many foreign documents have English titles. Trouble is, BasPlug assumes
that the entire <title> tag is on one line; if it isn't, as in your
</title> case, it won't be removed.
Amusingly, with <title>Hello
<title> left in my test document textcat
is convinced the document is Italian! And /this/ is why the sort is
messed up -- it seems that the list classifiers look at the document
language when sorting.
My fix to this is to change the line in BasPlug::get_language_encoding from
$text =~ s/<title>.*?</title>//i;
$text =~ s/<title>(.|
This will cause the <title> tag to be removed correctly, textcat won't
think it is Italian, and it will be sorted correctly by the classifier.
All the best,
Tim Finney wrote:
> I discovered what seems to be a bug with the plain list classifier. If
> some of my web pages have <title>blah
</title> (i.e. a line break
> before the closing tag), then they end up appearing at the end of the
> list rather than in their correct alphabetical place. The cure is
> simple -- remove the line break which should never have been there in
> the first place.
> Tim Finney
> greenstone-users mailing list