Latin-1 characters (0x80 - 0xFF) are encoded as two-byte by UTF-8.

UTF-8 ASCII
HEX BIN DEC DEC 
C2 A0 1100-0010 1010-0000194 160 160 
C3 80
1100-0011 1000-0000195 128192 
C3 81 1100-0011 1000-0001195 129193

In some case, we might need to translate the Latin-1 characters encoded by UTF-8 back to one-byte.

First we need to know how UTF-8 encodes the Latin-1 from one-byte to two-byte. All two-byte UTF-8 characters have the following fixed encoding format. the value of 'x' depends the character being encoded.

110x,xxxx 10xx,xxxx 

In the first byte, the first two '11' means that this is a two byte character. the closest followed '0' is a fixed flag for spliting the first two flag bits with the rest bits.

In the second byte, the first '10' is fixed flag to make a difference with the leading byte, for example, 110x,xxxx is a leading byte for a character. All ASCII characters is leading by a 0xxx,xxxx, so if the application reads a byte like 10xx,xxxx and can not read the byte before it, then it can abandon this byte to read next.

For an ASCII encoded Latin-1 character 0xC1, its binary encoding is:

11000001

Fowlling shows how it maps to the two-byte in UTF-8 encoding.

11000011 10000001 

110 and 10 is fix flag.

000 is the fillers.

Way one 

Following is the implement by DB2 SQL PL.

SET h_s = SUBSTR( str, i, 1 ) ;
SET l_s = SUBSTR( str, i + 1, 1 ) ;

SET h_s = CHR( MOD( ASCII( h_s ) * 64, 256 ) ) ;
SET l_s = CHR( MOD( ASCII( l_s ) * 4, 256 ) / 4 ) ;

RETURN CHR( ASCII( h_s ) + ASCII( l_s ) ) ;

In the code above, we use the operators *(multiple), /(divide), +(plus) and 'mod' instead of the bit-operators in C language, <<, >> and 'or'. It's the same with following C code.

h_s = h_s << 6 ;
l_s = l_s << 2 ;
l_s = l_s >> 2 ;

return h_s or l_s ;

Way two 

SET h_s = SUBSTR( str, i, 1 ) ;
SET l_s = SUBSTR( str, i+1, 1 ) ;

ELSEIF (h_s = 194) THEN -- latin-1
-- 0xC2 0x80 - 0xC2 0xBF ==> 0x80 - 0xBF
RETURN l_s ;

ELSEIF (h_s = 195) THEN -- latin-1
-- 0xC3 0x80 - 0xC3 0xBF ==> 0xC0 - 0xFF, 0xC0 - 0x80 = 0x40(64)
RETURN CHR( ASCII( l_s ) + 64 ) ;

END IF;

Post a comment

mail.png


相似文章|Related Entries

最近更新|Recent Entries

不定期更新|Handy Entries

相似标签|Related Tags

分类栏目|Categories

按月归档|By Month

2008
01
2007
12
10
07
06
05
04
03
02
01
2006
12
11
10
09
08
07
06
05
04
03
02
01
2005
11
10
09
08
07
04
03
2004
12
11
10
09
08
07
06
05
04
03
02
01
2003
12
10
09
08
06
2002
09
08
04
03
02
2001
12
09
07
06
05

站内链接|Site Links

Powered by
Movable Type 3.34