Syntax-K

Know-How für Ihr Projekt

Perl Documentation

NAME

Encode::Supported -- Encodings supported by Encode

DESCRIPTION

Encoding Names

Encoding names are case insensitive. White space in names is ignored. In addition, an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking the first in the following sequence (with a few exceptions).

In case de jure canonical names differ from that of the Encode module, they are always aliased if it ever be implemented. So you can safely tell if a given encoding is implemented or not just by passing the canonical name.

Because of all the alias issues, and because in the general case encodings have state, "Encode" uses an encoding object internally once an operation is in progress.

Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurrence of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules but you don't have to use Encode::XX to make them available for most cases. Encode.pm will automatically load those modules on demand.

Built-in Encodings

The following encodings are always available.

Canonical     Aliases                      Comments & References
----------------------------------------------------------------
ascii         US-ascii ISO-646-US                         [ECMA]
ascii-ctrl			                  Special Encoding
iso-8859-1    latin1                                       [ISO]
null				                  Special Encoding
utf8          UTF-8                                    [RFC2279]
----------------------------------------------------------------

null and ascii-ctrl are special. "null" fails for all character so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL CHARACTERS will fall back to character references. Ditto for "ascii-ctrl" except for control characters. For fallback modes, see Encode.

Encode::Unicode -- other Unicode encodings

Unicode coding schemes other than native utf8 are supported by Encode::Unicode, which will be autoloaded on demand.

----------------------------------------------------------------
UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
UCS-2LE                                                     [UC]
UTF-16                                                      [UC]
UTF-16BE                                                    [UC]
UTF-16LE                                                    [UC]
UTF-32                                                      [UC]
UTF-32BE	UCS-4                                         [UC]
UTF-32LE                                                    [UC]
UTF-7                                                  [RFC2152]
----------------------------------------------------------------

To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, see Encode::Unicode.

UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit encoding. It is implemented separately by Encode::Unicode::UTF7.

Encode::Byte -- Extended ASCII

Encode::Byte implements most single-byte encodings except for Symbols and EBCDIC. The following encodings are based on single-byte encodings implemented as extended ASCII. Most of them map \x80-\xff (upper half) to non-ASCII characters.

gsm0338 - Hentai Latin 1

GSM0338 is for GSM handsets. Though it shares alphanumerals with ASCII, control character ranges and other parts are mapped very differently, mainly to store Greek characters. There are also escape sequences (starting with 0x1B) to cover e.g. the Euro sign.

This was once handled by Encode::Bytes but because of all those unusual specifications, Encode 2.20 has relocated the support to Encode::GSM0338. See Encode::GSM0338 for details.

CJK: Chinese, Japanese, Korean (Multibyte)

Note that Vietnamese is listed above. Also read "Encoding vs Charset" below. Also note that these are implemented in distinct modules by countries, due to the size concerns (simplified Chinese is mapped to 'CN', continental China, while traditional Chinese is mapped to 'TW', Taiwan). Please refer to their respective documentation pages.

Miscellaneous encodings

Unsupported encodings

The following encodings are not supported as yet; some because they are rarely used, some because of technical difficulties. They may be supported by external modules via CPAN in the future, however.

Encoding vs. Charset -- terminology

We are used to using the term (character) encoding and character set interchangeably. But just as confusing the terms byte and character is dangerous and the terms should be differentiated when needed, we need to differentiate encoding and character set.

To understand that, here is a description of how we make computers grok our characters.

Technically, or mathematically, speaking, a character set encoded in such a CES that maps character by character may form a CCS. EUC is such an example. The CES of EUC is as follows:

By carefully looking at the encoded byte sequence, you can find that the byte sequence conforms a unique number. In that sense, EUC is a CCS generated by a CES above from up to four CCS (complicated?). UTF-8 falls into this category. See "UTF-8" in perlUnicode to find out how UTF-8 maps Unicode to a byte sequence.

You may also have found out by now why 7bit ISO-2022 cannot comprise a CCS. If you look at a byte sequence \x21\x21, you can't tell if it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no trouble differentiating between "!!". and pppp.

Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their applicability for information exchange over the Internet and to choose the most suitable aliases to name them in the context of such communication.

Encoding names

US-ASCII    UTF-8    ISO-8859-*  KOI8-R
Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
EUC-KR      Big5     GB2312

are registered with IANA as preferred MIME names and may be used over the Internet.

Shift_JIS has been officialized by JIS X 0208:1997. "Microsoft-related naming mess" gives details.

GB2312 is the IANA name for EUC-CN. See "Microsoft-related naming mess" for details.

GB_2312-80 raw encoding is available as gb2312-raw with Encode. See Encode::CN for details.

EUC-CN
KOI8-U        [RFC2319]

have not been registered with IANA (as of March 2002) but seem to be supported by major web browsers. The IANA name for EUC-CN is GB2312.

KS_C_5601-1987

is heavily misused. See "Microsoft-related naming mess" for details.

KS_C_5601-1987 raw encoding is available as kcs5601-raw with Encode. See Encode::KR for details.

UTF-16 UTF-16BE UTF-16LE

are IANA-registered charsets. See [RFC 2781] for details. Jungshik Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and NS 4/6. Beware however that

The rule of thumb is to use UTF-8 unless you know what you're doing and unless you really benefit from using UTF-16.

ISO-IR-165    [RFC1345]
VISCII
GB 12345
GB 18030 (**)  (see links below)
EUC-TW   (**)

are totally valid encodings but not registered at IANA. The names under which they are listed here are probably the most widely-known names for these encodings and are recommended names.

BIG5PLUS (**)

is a proprietary name.

Microsoft-related naming mess

Microsoft products misuse the following names:

Glossary

See Also

Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW, Encode::EBCDIC, Encode::Symbol Encode::MIME::Header, Encode::Guess

References

Other Notable Sites

Offline sources