I don't know a lot about Korean but have built a few collections in
Chinese and Japanese. The first thing I'd suggest is that you open the
HTML file in your web browser again, and make sure it's readable. You
should then be able to check what character encoding the browser is
using to display it (in firefox you go to the "View" menu and select
"Character Encoding"). Once you're sure of the character encoding you'll
be better placed to solve the problem. If it's an encoding that
Greenstone already supports (e.g. UHC or UTF-8) you can set the
input_encoding option appropriately and it should work. If it's one of
the several Korean encodings that Greenstone doesn't support we'll need
to come up with some mapping files to make it work, but that's not too
hard. Alternatively, you could use iconv to convert the file to UTF-8
before importing it into Greenstone.
Greenstone Digital Library and Digitisation Specialists
Julian Fox wrote:
> If anyone has experience dealing with CJK languages they might be able
> to help me with this one. I have a Korean text already in html. When
> I check its character set it seems to be using gb2312 which is
> simplified Chinese. I can open it up in my browser and see that it's
> pretty clearly Korean however - and a Korean could read it!
> When I adjusted the html plug for simplified Chinese, at least I got a
> result which indicates there is text - but it is just oblongs. With
> other encoding I couldn't even get that far - I tried Korean Hangul
> and just got blanks.
> As I say, if someone has had to deal with this in Greenstone they
> almost certainly have a way to deal with it. At the moment my only
> solution is to run the text to pdf then used pagedimage - that way
> users can read the text but they can't search it.
> Solution(s would be much appreciated - if such exist(s).
> greenstone-users mailing list