Re: [greenstone-users] searches with special characters

From James R. Adair
DateFri, 06 Jun 2003 17:03:24 -0400
Subject Re: [greenstone-users] searches with special characters
In-Reply-To (3EDFCBFA-6060704-cs-waikato-ac-nz)
On Thursday, June 5, 2003, at 07:02 PM, John R. McPherson wrote:

> James R. Adair wrote:
>> My HTML file has this: Wörtern für
>> After Greenstone processes it, it looks like this: Wörtern fÃ1⁄4r
> That is what the 2 byte utf-8 character looks like if it is displayed
> as Western instead (it is displayed as 2 single byte characters). If
> you want to see if greenstone imported it properly, you can look
> at the doc.xml files in the archives directory (these files are all
> encoded in utf-8). If it's fine in the file, then it might be a browser
> issue. If it's not fine in the file, you might have a too-old
> version of greenstone. You could also try explicitly telling the
> plugin which encoding your files are in, although I'm pretty sure
> that this shouldn't make a difference for entity (&....;) conversions.

I was running version 2.38, so I upgraded to 2.39, but I'm still
getting the same results. The doc.xml file for the words above has
W�~C¶rtern f�~C¼r, and when my browser displays it, it still looks like
Wörtern fÃ1⁄4r. It almost looks like Greenstone is converting to
UTF-8 twice--at any rate, I end up with four bytes apiece for an
o-umlaut and a u-umlaut, when their UTF-8 encoding should be two bytes, right? Interestingly, the very first character with an umlaut in this file appears in doc.xml as künstliche, which ends up
displaying properly (View Source shows künstliche), but of course I can't search on it (the original file has künstliche and
für, but the two u-umlauts are rendered differently!). I've tried viewing the files on a variety of browsers on different platforms
(Windows & Mac), all set to use UTF-8 encoding, all with the same

It looks to me like the doc.xml file is the problem. You said to try
explicitly telling the plugin the encoding of the files. Which plugin? The file is at select PBP-test, title
a-z, 62... (the first item).

>> that I can eventually search them? Do I need to upgrade to a more
>> recent version of Greenstone?
> I don't know - you didn't say which version you are using :) It works
> fine with gsdl version 2.39.
> John McPherson


James R. Adair
Director, Religion and Technology Center
5385 Five Forks Trickum Rd., Suite 202
Stone Mountain, GA 30087
(770) 806-8747 - phone
(770) 925-3835 - fax