Sorry for taking a while to get back to this. I had a play around
with the files you sent, without any great success. Here's what I
The mok.htm file does indeed appear to be in one of the Chinese GB
encodings. It displays ok in Firefox with the encoding set to GB2312.
In IE7 it doesn't display correctly as GB2312 however, even though
the browser auto-selects that encoding. If I change the encoding to
GB18030 it displays ok in both IE7 and Firefox though, so I'm
guessing it's GB18030 and Firefox is just smart enough to display it
correctly even if GB2312 is selected.
The second file you sent appears to be slightly broken UTF-8, and
creates an XML validation problem once imported into Greenstone.
In any case, I had a go at converting the GB file to UTF-8 using
iconv (I used a cygwin installation under windows, but someone else
on the list might be able to advise you on an easier way to get
iconv). I used a command like the following.
iconv -f GB18030 -t UTF-8 mok.htm > new.htm
The resulting file looked fine when opened as UTF-8 under Firefox
(even though Firefox auto-detected the encoding as GB2312 again!). It
also imported into a Greenstone collection ok, but when displayed in
Greenstone there were a few errors (it looked very similar to the
broken UTF-8 file you sent in fact).
So I don't have a perfect solution for you unfortunately. One more
thing to try might be to convert from GB18030 to UHC and see if that
works any better, though I don't see why it should. If there's anyone
on the list with experience of these specific CJK encodings they
might be able to help further. Having said that, the fact that the
UTF-8 file I created looked fine before importing into Greenstone,
then had issues once imported, suggests there might be a problem with
the Greenstone import process.
I don't have time to look in to this further unfortunately, but
please let me know if you make any further discoveries.
On 4/04/2007, at 6:17 PM, Julian Fox wrote:
> No problems. You'll see it's a mixture of Korean and Italian.
> I'm adding a version I HAVE managed to get into UTF-8 (or rather
> that someone else got it into form Korea itself) and I am having
> about 95% success with that when Greenstone converts it, but get
> groups of hieroglyphics still which Koreans will obviously be
> annoyed by. I note that nobody has tackled translation for Korean
> in Greenstone yet and maybe it's just one of those very difficult
> character sets to deal with - not an exact science!