Re: HTML entities

From John R. McPherson
DateSat, 8 Feb 2003 17:12:08 +1300
Subject Re: HTML entities
In-Reply-To (20030207175243A-tao-lib-uchicago-edu)
On Fri, Feb 07, 2003 at 05:52:43PM -0600, Tod Olson wrote:
> >>>>> "J" == James R Adair <jadair@reltech.org> writes:
>
> J> I want to display a variety of standard HTML entities in
> J> Greenstone. The entity refs (e.g., &uuml;) are in the HTML files
> J> in the import directory, but after I run import.pl and buildcol.pl,
> J> the resulting files display garbage (in the case of &uuml;, ü). I
> J> assume this is some sort of attempt to generate Unicode, but what I
> J> really want is just to have the entity pass straight through so
> J> that the Web browser can interpret it properly. Any suggestions?
>

> ü seems likely to be the UTF-8 value for □.
> Maybe it's being converted to UTF-8 twice (I've had that), or maybe
> the browser is set to use a specific character set.

This can happen if you are using one of the third party converters that
spits out utf-8, but you have specified a different encoding such as
iso-8859-1. Or perhaps you need to explicitly tell it that it's utf-8,
although I think this seems unlikely.

One problem with passing html entities through is that you can
no longer search for words containing them... eg:
xxx□zzz vs xxx&uuml;zzz
Users will type the first into the search box...

Having said that, if you are sure you don't want to convert HTML
entities such as &eacute; etc, edit perllib/plugins/HTMLPlug.pm
and comment out (by inserting a '#' at the start of the line)
the line near the bottom that says:
$$textref =~ s/&([^;]+);/&ghtml::getcharequiv($1,1)/gseo;

Incidentally, this function converts entities to their utf-8
code.

Hope this helps
John McPherson