From | graeme |
Date | Sun, 18 Feb 2007 09:40:21 +0430 |
Subject | Re: [greenstone-users] UTF-8? |
In-Reply-To | (45D75791-6090502-sdb-org) |
As I understand it UTF-8 is not a superset of ISO-8859-1. In UTF-8 there is a direct mapping to 7-bit ASCII but not the 8th bit which contains the accented characters of the Latin alphabet. This is because of the design of UTF-8, The first bit(s) of a UTF-8 character is always used to describe how many bytes are required to fully describe the UNICODE character. So for a one byte character the first bit is always 0, for a 2 byte character the first three bits are 110 (the first two bits of the subsequent byte is 10), that means that 5 bits of the 16 available are used in identifying the type and for a simple error checking, 11 bits are available (excluding the 128 bit range available for 1 byte encoding) that allows 1920 characters to be encoded using the 2 byte scheme. UTF-8 continues into three and four byte encodings.
In ISO-8859-1 à has an encoding E0, where as in UTF-8 it will be C3A0 E0 = 11100000 E0 = 00011 100000 (as 11 bits) E0 = yyyyy zzzzzz Convert to a two byte UTF-8 is given by 110yyyyy 10zzzzzz Thus it equals: 11000011 10100000 = C3 A0 110yyyyy 10zzzzzz Sorry if my explanation is too much, but in short UTF-8 and IS08859-1 are different animals (when it gets to the higher bit, which is where the accented characters reside). Graeme. On 2/17/07, Julian Fox <jbfox@sdb.org> wrote: Dear List, |