Re: [greenstone-users] searches with special characters

From James R. Adair
DateTue, 10 Jun 2003 16:12:16 -0400
Subject Re: [greenstone-users] searches with special characters
In-Reply-To (3EE5100B-30307-cs-waikato-ac-nz)
I finally figured out how to make the special characters display
properly and be searchable. In gsdl/perllib/doc.pm, in the subroutine
buffer_section_xml, after the line:

my $escaped_value = &_escape_text($data->[1]);

add the line:

my $junk = substr $escaped_value,0;

then change the next line,

$all_text .= ' <Metadata name="' . $data->[0] . '">' .
$escaped_value . "</Metadata> ";

to:

$all_text .= ' <Metadata name="' . $data->[0] . '">' . $junk .
"</Metadata> ";

For some reason, which I haven't yet determined, whenever the presence
of the metadata.xml file causes _non-default_ metadata to be written to
the doc.xml file, $data->[1] becomes typed as something other than a
string. The line my $junk ... forces it back into string format, and
everything is fine. Since $data->[1] should just contain string data
(when I print it, that's all I see), I'm wondering if this isn't some
sort of perl bug--I'm using perl 5.6.1 for Linux.

I still have the problem with one of my files (of the three I've tried)
where the first instance of the entity &uuml; is passed through to
doc.xml as &amp;uuml;. The resulting HTML file, generated on the fly,
has &uuml;, which displays properly but can't be searched. However, I
can live with the occasional anomaly at this stage of the game, so I
probably won't try to track this problem down.

Thanks for the help.

Jimmy

James R. Adair
Director, Religion and Technology Center
5385 Five Forks Trickum Rd., Suite 202
Stone Mountain, GA 30087
(770) 806-8747 - phone
(770) 925-3835 - fax
http://www.reltech.org
jadair@emory.edu


On Monday, June 9, 2003, at 06:54 PM, John R. McPherson wrote:

> James R. Adair wrote:
>
>> I imported another document and put it in a directory without a
>> metadata.xml file, and it worked correctly, umlauts and all. I went
>> back and removed the metadata.xml file from directory of one of the
>> two previous documents, and now it works fine, too. I don't know why
>> the metadata.xml file affects the rendering of the entities, but it
>> apparently does. Any ideas? I'll play around with it to see if I
>> can figure it out, and if I do, I'll report back to the list.
>
> It's possible that the data in the metadata.xml files is
> accidentally being "converted" again to utf-8, although the file is
> supposed to be utf-8 in the first place and so shouldn't get
> converted. Did you have html entities in the metadata.xml file? If so,
> you could try replacing them with the proper utf-8 codes for the
> accent characters.
>
> I don't know that much about the metadata.xml stuff, so maybe someone
> else can look into this...
>
> John McPherson