Monday, 15 February 2010

Behavior of extended bytes/characters in C/POSIX locale -


Both C and POSII require only limited characters in C / POSIs locale, but additional characters are present . It leaves many freedoms for implementation; For example, support for all Unicode (in the form of UTF-8) in C locale is consistent with behavior. However, the most historic implementation takes C-locale "8-bit-clean" single-byte character encoding, either as ISO-885 9-1 (Latin-1) or a "abstract bit-bit set set" Non-ASCII bytes are abstract characters that are not with any particular identities. (In the latter case, if the compiler defines __ STDC_SIO_10646___ , then they usually correspond to Unicode characters , Usually Latin-1 grade.)

Another analogous option that seems too low is popular that all non-ASCII bytes are treated as non-characters, i.e. they respond with a EILSEQ error Please give it.

I am interested in knowing whether the implementation is to take this or any other unusual option in implementing the local area. Is there any implementation where attempt to convert "high bytes" in local local results EILSEQ or in addition to treatment in the form of single-byte characters or UTF-8 (abstract or Latin-1) Does anything?

In the last reply to your comment:

Basically it is possible that bytes out of portable character sets can be illegal non-bytes (EILSEQ) or some multibyte encoding (UTF-8 or a Stateless Legacy CJK encoding)

You can find an example

Plan 9 supports only "C" locale as you can see and, when it comes to port Ebl letter is a rune out, so it handles it as a character from a different encoding.

There may be another candidate and (as far as they use). In the Minix source code, I also searched for new encoding when the size of the letter was not 8bit.

No comments:

Post a Comment