Utf 16Edit
UTF-16 is a character encoding form used within the Unicode system. It stores text as sequences of 16-bit code units and is capable of representing all Unicode code points. In practice, most common characters fit into a single 16-bit unit, while characters outside the Basic Multilingual Plane (BMP) require a pair of 16-bit units called a surrogate pair. This design makes UTF-16 a middle-ground option between compactness for some scripts and the simplicity of fixed-width encodings. It is widely used in several major software ecosystems, including the Windows API and the core string types in some popular programming languages. For broader context, see Unicode and the related encoding forms such as UTF-8 and UTF-32.
UTF-16 is defined as a transformation form of the Unicode character set. In memory and many file formats, text is represented as 16-bit code units. A single code point in the Basic Multilingual Plane (U+0000 to U+FFFF, excluding the surrogate range) maps directly to one 16-bit unit. Code points above U+FFFF (up to U+10FFFF) are encoded using two 16-bit units, a lead surrogate followed by a trail surrogate. The ranges used for these surrogates are standardized (lead: 0xD800–0xDBFF; trail: 0xDC00–0xDFFF), and the resulting value can be computed back to the original code point. See surrogate pair for details on this mechanism.
Endianness is a key practical concern with UTF-16. The two common byte orders are UTF-16BE (big-endian) and UTF-16LE (little-endian). To disambiguate endianness when files or streams are exchanged, a Byte Order Mark (BOM) may be placed at the start of the text; the BOM itself is encoded as a Unicode character (U+FEFF) and appears in the translated byte sequence as 0xFEFF in big-endian form or 0xFFFE in little-endian form. Some contexts allow omitting the BOM and rely on explicit specification of endianness, which can lead to misinterpretation if the sender and receiver disagree. See Byte Order Mark and Endianness for background.
Technical notes
- Code units and code points: UTF-16 uses 16-bit code units. A code point in the BMP fits in one unit, while supplementary code points require a surrogate pair. See code point and surrogate pair for more on the relationship between code points and code units.
- Endianness forms: The standard forms are UTF-16BE and UTF-16LE. Some systems internally store text in one form and convert as needed for external interfaces. See Endianness.
- Comparison with other Unicode encodings: UTF-16 often provides a compact representation for East Asian and other non-Latin scripts when many characters reside in the BMP, but UTF-8 can be more space-efficient for ASCII-heavy text and is more common on the web. See UTF-8 and UTF-32 for comparison.
- Language and platform support: Several major platforms rely on UTF-16 for internal string representation. For example, Java uses UTF-16 code units for its String type, and the Windows API historically uses UTF-16 for wide-character strings. In the .NET framework, System.String is encoded in UTF-16. See the respective articles for implementation details and API considerations: Java, Windows API, .NET.
- Practical pitfalls: Mixed environments can lead to confusion about string length, indexing, and iteration, since counting code units differs from counting code points. Libraries and language runtimes typically offer utilities to work in terms of code points or to handle surrogate pairs correctly. See code point for more on this distinction.
Adoption and usage
- Windows ecosystem: The core string types and APIs in Windows environments are built around UTF-16 at the level of code units, and this influences file formats, inter-process communication, and many developer tools. See Windows API.
- Programming languages and runtimes: Java uses UTF-16 as its internal string representation, while most of the C family exposes UTF-16 through types like char16_t or platform-specific wide strings. See Java and char16_t.
- Cross-platform and files: While desktop and enterprise software often rely on UTF-16 for internal text handling, data interchange on the Internet and in many open formats frequently uses UTF-8 for its ASCII friendliness and simplicity of progress tracking. See UTF-8.
Characteristics and design considerations
- Pros: UTF-16 offers relatively compact storage for many non-Latin scripts, and it aligns well with the needs of environments that already operate in 16-bit units. It simplifies certain kinds of string processing for languages with many BMP characters and for APIs designed around 16-bit code units.
- Cons: For text that is predominantly ASCII, UTF-8 is typically more space-efficient. Endianness and the handling of surrogate pairs add complexity to string processing, indexing, and I/O. Data interchange requires careful specification of the encoding form (and endianness) to avoid misinterpretation.
- Compatibility: Because UTF-16 is a standard Unicode encoding form, it remains compatible with the broader Unicode ecosystem, but practical interoperability depends on clearly defined encoding parameters and correct handling of surrogate pairs, endianness, and BOM usage.
See also