Unicode module

Copyright © 1995-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.

Introduction

The Unicode module gives access to the available from the Unicode database.

Current information about the Unicode module.

For a definition of terms used in this document, please refer to the Unicode glossary.

Interface overview and API documentation

API documentation

The API documentation is extracted automatically by Doxygen.

Overview

Most of the public API of this module of this module is exported through the Unicode class. This class is never instantiated, but instead used as a namespace. The Unicode class has six types of functionality:

  1. Normalization
  2. Line breaking
  3. Case conversion
  4. Property lookup
  5. Directionality lookup
  6. Script-type lookup

Please read the API documentation and below for details.

The UTF8Decoder and UTF8Encoder classes implement stand-alone support for converting from and to the UTF-8 encoding form. The encodings module uses classes derived from these when decoding UTF-8 data through the CharConverter framework.

Unicode defines the type UnicodePoint used to represent a single code point. Functions operating on on single code points typically takes this a the argument. To use these correctly the calling code needs to decode surrogate pairs into UnicodePoint's. See Handling surrogate pairs for more information.

The UnicodeSegmenter class supports segmentation of text, see below for more information.

The UnicodeStringIterator is a utility class for iterating through strings.

The BidiStatus, ParagraphBidiSegment and BidiCalculation classes implement the Unicode bidirection algorithm, see below for more information.

Some algorithms have been tailored to suit Opera and the web.

Handling surrogate pairs

When dealing with single unicode points you need to be aware of surrogate pairs. Functions operating on UnicodePoint's need the full unicode point therefor any surrogate pairs need to be decoded before beeing passed. When using function that operates on strings (uni_char*) this is done internally and no special handling is required by the caller.

To help the users Unicode provides a couple of helper functions. Unicode::DecodeSurrogate() and Unicode::MakeSurrogate() can be use decode and separate pairs respectively. To iterate through a string Unicode::GetUnicodePoint() can be used.

Use-cases

Converting to and from UTF-8 and UTF-32

Opera internally stores all text in the UTF-16 format, but this is just one of the several transformation formats that can be used for Unicode text. For byte-oriented storage and network protocols, the UTF-8 encoding is well suited.

Use UTF8Encoder::Measure() to determine the space requirements for storing a piece for UTF-16 text in UTF-8 format, and UTF8Encoder::Convert() to convert.

Conversly, use UTF8Decoder::Measure() to determine the space requirements for storing a piece of UTF-8 text in UTF-16 format (note that the return value is in bytes, not in UTF-16 code units), and UTF8Decoder::Convert() to convert.

Some systems internally make use of the UTF-32 encoding (notably those using the GNU C library, glibc) for its one-to-one mapping between Unicode code-points and 32-bit integer values. If API_UC_UTF32_CODECS is imported from the Unicode module, the classes UTF32Encoder and UTF32Decoder are made available for conversion from and to the internally used UTF-16 encoding. Their APIs mirror those of the UTF-8 classes described above.

UTF-32 data is only supported when encoded in the machine endian of the running binary.

String normalization

Use Unicode::Normalize() to convert any string of UTF-16 code units into one of the four Unicode normalization forms NFC, NFKC, NFD and NFKD.

Narrow/Wide decomposition

Use Unicode::DecomposeNarrowWide() to replace all fullwidth and halfwidth characters (those defined with decomposition types <wide> and <narrow>) to their decompositions.

Case conversion

Use Unicode::ToUpper() and Unicode::ToLower() to convert a single character to a specific case.

Use Unicode::Upper() and Unicode::Lower() to convert a nul-terminated string, either in-place or out-of-place.

Case folding

Case folding can be an important part of normalizing text for comparison purposes. The main difference between simple case folding and full case folding is that the first doesn't change the string length while the latter can. Note that where they can be supported, the full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.

For more information about case folding please read section 3.13 "Default Case Algorithms" in The Unicode Standard 5.2.

Use Unicode::ToCaseFoldFull(const uni_char *) to perform full case folding of every character in the specified UTF-16 encoded string.

Use Unicode::ToCaseFoldFull(UnicodePoint, UnicodePoint *) to perform full case folding for the specified single Unicode codepoint.

Use Unicode::ToCaseFoldSimple() to perform simple case folding on the specified Unicode codepoint.

Class lookup

Use Unicode::GetCharacterClass() to retrieve information about the Unicode character class for a specific UTF-16 code unit.

There are also several convenience APIs, such as Unicode::IsUpper() to check for one or more specific classes.

Extended property lookup

Use Unicode::CheckPropertyValue() to retrieve value of one of the Unicode extended character properties for a specific Unicode codepoint.

The list of available properties to loookup for is defined by enum UnicodeProperties.

Derived property lookup

Use Unicode::CheckDerivedPropertyValue() to retrieve value of one of the Unicode derived character properties for a specific Unicode codepoint.

The list of available derived properties to loookup for is defined by enum UnicodeDerivedProperties.

Block type lookup

Use Unicode::GetUnicodeBlockType() to retrieve the Unicode block type for a specific Unicode codepoint.

Possible blocks are defined by enum UnicodeBlockType (generated from Blocks.txt).

Joining type lookup

Use Unicode::GetUnicodeBlockType() to retrieve the Unicode joining type for a specific Unicode codepoint.

Possible joining types are defined by enum UnicodeJoiningType and are:

JT_R - Right-Joining
For example arabic letters: ALEF, DAL, THAL, REH, ZAIN
JT_L - Left-Joining
Currently there is no character with this joining type in Unicode Character Database
JT_D - Dual-Joining
For example arabic letters: BEH, TEH, THEH, JEEM
JT_C - Join-Causing
U+200D zero width joiner and TATWEEL (0640). These characters are distinguished from the dual-joining characters in that they do not change shape themselves.
JT_U - Non-Joining
U+200C zero width non-joiner and all spacing characters, except those explicitly mentioned as being one of the other joining types, are non-joining.
JT_T - Transparent
All nonspacing marks (General Category Mn or Me) and most format control characters (General Category Cf) are transparent to cursive joining.

For more information about joining types please read section 8.2 "Arabic" in The Unicode Standard 5.2.

Line breaking

Use Unicode::IsLinebreakOpportunity() to look up line breaking opportunities for two characters. The method Unicode::GetLineBreakClass() can be used to access the line-breaking class for a specific code point.

Complex Context Dependent line breaking (South East Asian languages)

The second signature for Unicode::IsLinebreakOpportunity() performs line breaking that is aware of the LB_SA line-breaking class. The LB_SA class is used in some languages where SPACE is not used as a word separator, therefore line breaking (and word segmentation) depends on more in-depth knowledge of the language itself. Normally, for LB_SA text, this second signature for Unicode::IsLinebreakOpportunity() provides line breaking opportunities between every grapheme cluster (symbol). However, the code can be patched to do more sophisticated (typically language-dependent) line breaking algorithms.

One such language-dependent algorithm is included: FallbackThaiUnicodeLinebreaker. This class is derived from a simple and naive Thai line breaking algorithm first implemented in UserJS. This algorithm is far from 100% accurate, and instead uses a collection of data tables and very simple heuristics to insert line break opportunities. The data tables are:

Wordlist
A short list of (51) unambiguous Thai words
LeadingChars
A list of Thai characters that might appear at the start of a word
FollowingChars
A list of Thai characters that typically appears towards the end of a word
EOWChars
A list of Thai characters that only appear at the end of a word

The algorithm then inserts line break opportunities in the following cases:

Directionality lookup

Use Unicode::GetBidiCategory() to retrieve directionality information for a specific UTF-16 code unit.

Use Unicode::IsMirrored() and Unicode::GetMirrorChar() to look up mirroring information.

Script-type lookup

Use Unicode::GetScriptType() to retrieve script type for a specific code point.

Text segmentation

To find character (“grapheme cluster”), word, and sentence boundaries in text, use the UnicodeSegmenter class. To use it, you instantiate the object, determining which type of boundaries you wish to find, and iteratively call UnicodeSegmenter::FindBoundary() on the text. See the API documentation for details on how to use it.

There is a shortcut method UnicodeSegmenter::IsGraphemeClusterBoundary() that can be used if you need to check whether two UTF-16 code units should be kept together, either because of both belonging to the same surrogate character, or the second is an extension to the first (combining marks and similar).

The text segmenter implementation is documented separately.

Bidirectional algorithm

To support bi-directional text and right-to-left languages (such as Arabic and Hebrew), this module contains an implementation of the The Unicode bidirectional algroithm. This implementation is discussed in a separate document.

Updating data tables

The tables in the data subdirectory are retrieved from the Unicode database. When a new version of the database is released, these tables should be updated with the new files and the scripts in the scripts directory re-ran to update the generated files.

To run the scripts, enter the scripts directory and type make to create the data. The generated files are committed to the version control system.

Supported standards

The Unicode module is based on the data supplied by the Unicode Consortium. The data and algorithms are currently up-to-date with version 6.1 of the standard.

The following parts of the Unicode standard are supported by this module:

Implementation and design

See also the API documentation, implementation of text segmenter, implementation of the bidirectional algorithm and details on Opera-specific tailoring.

Memory management

Heap usage

The only code that allocates heap memory is the normalization code and out-of-place case conversion. The size of the allocated data is a function of the amount of data submitted to it. The memory allocated by it needs to be deallocated by the caller.

Stack usage

The normalization code contains recursive calls for decomposing characters. Decomposing any single character should be fairly shallow, though. Composing characters is done iteratively.

Static memory usage

There are no global pointers. The data obtained from the Unicode database is encoded in the binary image.

OOM policies

Any out-of-memory conditions will be reported to, and must be handled by, the caller.

Performance

To improve speed and reduce binary footprint, the data tables are compressed in various ways.

Directionality tables

The directionality (bidi) table is a run-length compressed table with mappings from UTF-16 code units to directionality classes. This table is binary-searched on lookup.

The last block found is cached in the module, since there often are repeated calls requesting information for characters in the same block.

The mirroring table is split into two for performance. The first table is sorted by UTF-16 code unit so that it can be binary searched. However, since this involves both code units there are some mappings that cannot be included it this table due to the mirroring ranges being overlapping. These special cases are added in a second table that is linearly-searched when lookup fails in the first table.

The last value looked up is cached internally, since we use the same method for checking if a character is mirrored as to find its mirror image.

The code using this tables is discussed in a separate document.

Case conversion tables

The case conversion is implemented as a three-way table, with case conversion to both lower and upper case stored in the same table.

The first table contains indices to the second. This table is binary searched for performance.

The second table contains case conversion information data. The lower CASE_INFO_BITS contains an index to the third table, whose data are interpreted according to the following bits. The data from the third table is either a bit mask or a shift index.

Lower delta
The value is a delta for converting from lower to upper case. When converting to upper case, this value is added to the input code point.
Upper delta
The value is a delta for converting from upper to lower case. When converting to lower case, this value is subtracted from the input code point.
Double delta
The value is a delta used for converting from title-case to either lower or upper case. When converting to upper case, this value is added to the input code point, and when converting to lower case, this value is subtracted from the input code point.
Large lower delta
Large uppder delta
These are the same as the lower/upper delta, except that the value is to be multiplied by two. This allows for storing deltas up to 65536.
Bit mask
The value is a bit mask for converting the case, using OR will get the uppercase character, using AND on the inverted value will get the lowercase character.
Bit mask offset
The value is an offset bit mask for converting the case, with the same rules as for the bit mask type, but shifted by one codepoint.

Special cases:

Normalization tables

The decomposition tables contains data directly from the Unicode database. The BMP table maps a single UTF-16 code unit into two single UTF-16 code units defining its decomposition. The non-BMP table maps one non-BMP codepoint expressed as a 32-bit value where the high word is the high surrogate and the low word is the low surrogate into two 32-bit values which in turn are either surrogate pairs or single BMP codepoints. These two characters are then in turn decomposed further.

The compatibility decomposition table is a table with flag bits stating whether the decomposition for the character is a compatibility decomposition or not.

The composition table contains the same information as the decomposition table, but sorted on the character to be composed.

The canonical combining class tables contains the classification for the combining marks in Unicode, used to sort them for re-composing. The BMP table uses a UTF-16 code unit as the index, whereas the non-BMP table uses a pair of UTF-16 surrogates mapped into a 32-bit word as index. To improve look-up speed, portions of the table data is hardcoded into the look-up function. This hardcoding is validated by the generator script, to ensure that it still holds after an upgrade of the Unicode character database.

Character properties tables

Both the character class and line-breaking properties tables are stored as run-length compressed tables with a flat structure for the first 256 codepoints. The run-length compressed tables are split into one table of indices and one table of data each.

Block type and joining type table

Both Block type and Joining type property tables are stored as run-length compressed forms which are split into one table of indices and one table of property values each. The first table of each (containing indices to the second) is binary searched for performance.

Extended properties tables

All extended property tables (used by CheckPropertyValue method) are constructed in this way that each element with even index is the begin of codepoints' range for which the value is TRUE and each odd element defines codepoint that starts the range with FALSE value.

Case folding tables

The full case folding table is just a sorted table of codepoints with it's mappings and is not compressed in any way.

The tables for simple case folding (both for BMP and for nonBMP) are stored as run-length compressed tables where each of elements contain 3 fields: codepoint starting the run, modifier and compression type. The compression type can be one of the following:

1. Constant delta
When mapping to case folded form, the modifier value is added to the input code point.
2. Set bit 0
When mapping to case folded form, the less significant bit of the input code point is set.
3. Clear bit 0
When mapping to case folded form, the less significant bit of the input code point is cleared.
4. Constant delta for even codepoints
For even input code points in order to obtain the case folded form the modifier value is added to input code point. Odd code points are mapped to themselfs.
5. Constant delta for odd codepoints
For odd input code points in order to obtain the case folded form the modifier value is added to input code point. Even code points are mapped to themselfs.

Text segmenter

The text segmenter implementation is documented separately.

Bidirectional algorithm

The bidirectional algorithm implementation is discussed separately.

Possible performance improvements

Normalization

The "quick check" optimization mentioned in the normalization algorithm is not implemented, the only similar optimization done is that no normalization is done if the input string only contains characters in the Basic Latin (ASCII) range.

Case conversion

Some optimizations have been attempted in the casing conversions. Reordering the ASCII tests is one possibility, although looking at different input profiles yields different results on what is faster.

Might be worth considering replacing the ASCII lookup with a table (will take 128 bytes in each direction since they only map to other ASCII characters).

Research whether Unicode::Upper() (string) is really more often called directly than Unicode::ToUpper() (character), and check if splitting out common parts into a third function makes sense. Is the function call overhead too big?

Text segmenter

Investigate whether it is possible to improve the implementation of the text segmenter by not using a simple state machine as the ones currently implemented.