Copyright © 1995-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.
The Unicode module gives access to the available from the Unicode database.
Current information about the Unicode module.
For a definition of terms used in this document, please refer to the Unicode glossary.
The API documentation is extracted automatically by Doxygen.
Most of the public API of this module of this module is exported through the
Unicode class.
This class is never instantiated, but instead used as a namespace.
The Unicode class has six types of functionality:
Please read the API documentation and below for details.
The
UTF8Decoder and
UTF8Encoder classes
implement stand-alone support for converting from and to the
UTF-8 encoding form.
The encodings module
uses classes derived from these when decoding UTF-8 data through the
CharConverter framework.
Unicode defines the type UnicodePoint used to
represent a single code point. Functions operating on on
single code points typically takes this a the argument. To
use these correctly the calling code needs to decode surrogate
pairs into UnicodePoint's. See
Handling surrogate pairs for more information.
The
UnicodeSegmenter
class
supports segmentation of text,
see below for more information.
The UnicodeStringIterator
is a utility class for iterating through strings.
The
BidiStatus,
ParagraphBidiSegment
and
BidiCalculation
classes implement the Unicode bidirection algorithm,
see below for more information.
Some algorithms have been tailored to suit Opera and the web.
When dealing with single unicode points you need to be aware of surrogate
pairs. Functions operating on UnicodePoint's need the full
unicode point therefor any surrogate pairs need to be decoded before
beeing passed.
When using function that operates on strings (uni_char*) this
is done internally and no special handling is required by the caller.
To help the users Unicode provides a couple of helper functions.
Unicode::DecodeSurrogate()
and Unicode::MakeSurrogate()
can be use decode and separate pairs respectively. To iterate through a string
Unicode::GetUnicodePoint()
can be used.
Opera internally stores all text in the UTF-16 format, but this is just one of the several transformation formats that can be used for Unicode text. For byte-oriented storage and network protocols, the UTF-8 encoding is well suited.
Use
UTF8Encoder::Measure()
to determine the space requirements for storing a piece for UTF-16 text in
UTF-8 format, and
UTF8Encoder::Convert()
to convert.
Conversly, use
UTF8Decoder::Measure()
to determine the space requirements for storing a piece of UTF-8 text in
UTF-16 format (note that the return value is in bytes, not in UTF-16 code
units), and
UTF8Decoder::Convert()
to convert.
Some systems internally make use of the UTF-32 encoding (notably those using
the GNU C library, glibc) for its one-to-one mapping between Unicode
code-points and 32-bit integer values.
If API_UC_UTF32_CODECS is imported from the Unicode module,
the classes
UTF32Encoder and
UTF32Decoder are made
available for conversion from and to the internally used UTF-16 encoding.
Their APIs mirror those of the UTF-8 classes described above.
UTF-32 data is only supported when encoded in the machine endian of the running binary.
Use Unicode::Normalize()
to convert any string of UTF-16 code units into one of the four
Unicode normalization forms NFC, NFKC, NFD and NFKD.
Use Unicode::DecomposeNarrowWide()
to replace all fullwidth and halfwidth characters (those defined with decomposition types <wide> and <narrow>) to their decompositions.
Use Unicode::ToUpper()
and Unicode::ToLower()
to convert a single character to a specific case.
Use Unicode::Upper()
and Unicode::Lower()
to convert a nul-terminated string, either in-place or out-of-place.
Case folding can be an important part of normalizing text for comparison purposes. The main difference between simple case folding and full case folding is that the first doesn't change the string length while the latter can. Note that where they can be supported, the full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
For more information about case folding please read section 3.13 "Default Case Algorithms" in The Unicode Standard 5.2.
Use Unicode::ToCaseFoldFull(const uni_char *)
to perform full case folding of every character in the specified UTF-16 encoded string.
Use Unicode::ToCaseFoldFull(UnicodePoint, UnicodePoint *)
to perform full case folding for the specified single Unicode codepoint.
Use Unicode::ToCaseFoldSimple()
to perform simple case folding on the specified Unicode codepoint.
Use Unicode::GetCharacterClass()
to retrieve information
about the Unicode character class for a specific UTF-16 code unit.
There are also several convenience APIs, such as
Unicode::IsUpper()
to check for one or more specific classes.
Use Unicode::CheckPropertyValue()
to retrieve value of one of the Unicode extended character properties for a specific Unicode codepoint.
The list of available properties to loookup for is defined by enum UnicodeProperties.
Use Unicode::CheckDerivedPropertyValue()
to retrieve value of one of the Unicode derived character properties for a specific Unicode codepoint.
The list of available derived properties to loookup for is defined by enum UnicodeDerivedProperties.
Use Unicode::GetUnicodeBlockType()
to retrieve the Unicode block type for a specific Unicode codepoint.
Possible blocks are defined by enum UnicodeBlockType (generated from Blocks.txt).
Use Unicode::GetUnicodeBlockType()
to retrieve the Unicode joining type for a specific Unicode codepoint.
Possible joining types are defined by enum UnicodeJoiningType and are:
For more information about joining types please read section 8.2 "Arabic" in The Unicode Standard 5.2.
Use
Unicode::IsLinebreakOpportunity()
to look up line breaking opportunities for two characters.
The method
Unicode::GetLineBreakClass()
can be used to access the line-breaking class for a specific code point.
The second signature for
Unicode::IsLinebreakOpportunity()
performs line breaking that is aware of the LB_SA line-breaking
class. The LB_SA class is used in some languages where SPACE is
not used as a word separator, therefore line breaking (and word segmentation)
depends on more in-depth knowledge of the language itself. Normally, for
LB_SA text, this second signature for
Unicode::IsLinebreakOpportunity()
provides line breaking opportunities between every grapheme cluster (symbol).
However, the code can be patched to do more sophisticated (typically
language-dependent) line breaking algorithms.
One such language-dependent algorithm is included:
FallbackThaiUnicodeLinebreaker.
This class is derived from a simple and naive Thai line breaking algorithm
first implemented in UserJS.
This algorithm is far from 100% accurate, and instead uses a collection of
data tables and very simple heuristics to insert line break opportunities.
The data tables are:
The algorithm then inserts line break opportunities in the following cases:
FallbackThaiUnicodeLinebreaker::IsLinebreakOpportunity()
Use
Unicode::GetBidiCategory()
to retrieve directionality information for a specific UTF-16 code unit.
Use Unicode::IsMirrored()
and
Unicode::GetMirrorChar()
to look up mirroring information.
Use
Unicode::GetScriptType()
to retrieve script type for a specific code point.
To find character (“grapheme cluster”), word, and sentence
boundaries in text, use the
UnicodeSegmenter
class.
To use it, you instantiate the object, determining which type of
boundaries you wish to find, and iteratively call
UnicodeSegmenter::FindBoundary()
on the text.
See the API documentation
for details on how to use it.
There is a shortcut method
UnicodeSegmenter::IsGraphemeClusterBoundary()
that can
be used if you need to check whether two UTF-16 code units should be
kept together, either because of both belonging to the same surrogate
character, or the second is an extension to the first (combining marks
and similar).
The text segmenter implementation is documented separately.
To support bi-directional text and right-to-left languages (such as Arabic and Hebrew), this module contains an implementation of the The Unicode bidirectional algroithm. This implementation is discussed in a separate document.
The tables in the
data
subdirectory are retrieved from the
Unicode database.
When a new version of the database is released, these tables should be
updated with the new files and the scripts in the
scripts
directory re-ran to update the generated files.
To run the scripts, enter the
scripts
directory and type make to create the data.
The generated files are committed to the version control system.
The Unicode module is based on the data supplied by the Unicode Consortium. The data and algorithms are currently up-to-date with version 6.1 of the standard.
The following parts of the Unicode standard are supported by this module:
U+2FFFF only
(BMP,
SMP
and
SIP).
U+FFFF only.
U+FFFF only.
See also the API documentation, implementation of text segmenter, implementation of the bidirectional algorithm and details on Opera-specific tailoring.
The only code that allocates heap memory is the normalization code and out-of-place case conversion. The size of the allocated data is a function of the amount of data submitted to it. The memory allocated by it needs to be deallocated by the caller.
The normalization code contains recursive calls for decomposing characters. Decomposing any single character should be fairly shallow, though. Composing characters is done iteratively.
There are no global pointers. The data obtained from the Unicode database is encoded in the binary image.
Any out-of-memory conditions will be reported to, and must be handled by, the caller.
To improve speed and reduce binary footprint, the data tables are compressed in various ways.
The directionality (bidi) table is a run-length compressed table with mappings from UTF-16 code units to directionality classes. This table is binary-searched on lookup.
The last block found is cached in the module, since there often are repeated calls requesting information for characters in the same block.
The mirroring table is split into two for performance. The first table is sorted by UTF-16 code unit so that it can be binary searched. However, since this involves both code units there are some mappings that cannot be included it this table due to the mirroring ranges being overlapping. These special cases are added in a second table that is linearly-searched when lookup fails in the first table.
The last value looked up is cached internally, since we use the same method for checking if a character is mirrored as to find its mirror image.
The code using this tables is discussed in a separate document.
The case conversion is implemented as a three-way table, with case conversion to both lower and upper case stored in the same table.
The first table contains indices to the second. This table is binary searched for performance.
The second table contains case conversion information data.
The lower CASE_INFO_BITS contains an index to the third
table, whose data are interpreted according to the following bits.
The data from the third table is either a bit mask or a shift index.
Special cases:
U+0000 to U+007F)
is hardcoded into the lookup to improve performance.
The decomposition tables contains data directly from the Unicode database. The BMP table maps a single UTF-16 code unit into two single UTF-16 code units defining its decomposition. The non-BMP table maps one non-BMP codepoint expressed as a 32-bit value where the high word is the high surrogate and the low word is the low surrogate into two 32-bit values which in turn are either surrogate pairs or single BMP codepoints. These two characters are then in turn decomposed further.
The compatibility decomposition table is a table with flag bits stating whether the decomposition for the character is a compatibility decomposition or not.
The composition table contains the same information as the decomposition table, but sorted on the character to be composed.
The canonical combining class tables contains the classification for the combining marks in Unicode, used to sort them for re-composing. The BMP table uses a UTF-16 code unit as the index, whereas the non-BMP table uses a pair of UTF-16 surrogates mapped into a 32-bit word as index. To improve look-up speed, portions of the table data is hardcoded into the look-up function. This hardcoding is validated by the generator script, to ensure that it still holds after an upgrade of the Unicode character database.
Both the character class and line-breaking properties tables are stored as run-length compressed tables with a flat structure for the first 256 codepoints. The run-length compressed tables are split into one table of indices and one table of data each.
Both Block type and Joining type property tables are stored as run-length compressed forms which are split into one table of indices and one table of property values each. The first table of each (containing indices to the second) is binary searched for performance.
All extended property tables (used by CheckPropertyValue method) are constructed in this way that each element with even index is the begin of codepoints' range for which the value is TRUE and each odd element defines codepoint that starts the range with FALSE value.
The full case folding table is just a sorted table of codepoints with it's mappings and is not compressed in any way.
The tables for simple case folding (both for BMP and for nonBMP) are stored as run-length compressed tables where each of elements contain 3 fields: codepoint starting the run, modifier and compression type. The compression type can be one of the following:
The text segmenter implementation is documented separately.
The bidirectional algorithm implementation is discussed separately.
The "quick check" optimization mentioned in the normalization algorithm is not implemented, the only similar optimization done is that no normalization is done if the input string only contains characters in the Basic Latin (ASCII) range.
Some optimizations have been attempted in the casing conversions. Reordering the ASCII tests is one possibility, although looking at different input profiles yields different results on what is faster.
Might be worth considering replacing the ASCII lookup with a table (will take 128 bytes in each direction since they only map to other ASCII characters).
Research whether Unicode::Upper() (string) is really more often called directly than Unicode::ToUpper() (character), and check if splitting out common parts into a third function makes sense. Is the function call overhead too big?
Investigate whether it is possible to improve the implementation of the text segmenter by not using a simple state machine as the ones currently implemented.