Encodings module

Copyright © 1999-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.

Introduction

The Encodings module provides support for converting text to and from legacy characters encodings to the Opera internal encoding (UTF-16). It also provides a framework for loading binary data tables from an external source (the encoding.bin file).

Current information about the Encodings module.

For information on generating the data files, needed by the code in this module, and which third-party features must to be enabled when using them, please refer to the i18ndata module documentation.

Interface overview and API documentation

The API documentation is extracted automatically by Doxygen.

Use-cases

Converting network data to internal representation

Using the InputConverter::CreateCharConverter() method, an instance of the InputConverter class is retrieved. The byte stream coming in from the network is then pumped through the Convert() method, converting the input from the legacy encoding used in the data retrieved from the network to the internal representation (UTF-16).

There are helper functions available in other modules to facilitate this, the URL_DataDescriptor class will set up the proper encoding conversion automatically when instantiated with the correct parameters, but it is also possible to use the class directly.

Converting internal representation to network data

Similarly to the above example, the OutputConverter::CreateCharConverter() method is used to create an instance of the OutputConverter class. Since both it and InputConverter inherit from the CharConverter class, the Convert() interface is identical. The OutputConverter converts from the internal encoding (UTF-16) to the legacy encoding required for the network.

Converting text strings

Opera should normally only need to handle the internal representation (UTF-16), but there are some cases besides network communication where conversions into other formats is required. For this, the OpString class has APIs for converting strings to and from legacy data. If you do not wish to use OpString, the conversion classes can easily be used on regular nul-terminated C strings, you must just remember that all the length information submitted to the Convert() are counted in bytes, and that you must include the trailing nul character in the byte count if you wish the output to be nul-terminated.

Including the trailing nul character in the count also ensures that the string is properly self-contained.

Detecting the encoding of network data

The CharsetDetector class is used to find out which encoding is used in a document. There are a couple of class methods that can be used to check for HTML meta tags, BOMs and similar. To perform actual encoding guessing, an instance is created of the class, and the byte stream is pushed through the PeekAtBuffer() method. When all the data has been pushed, the method GetDetectedCharset() will return an encoding tag, or NULL if none was detected.

The CharsetDetector class is light-weight and is usually allocated on the stack.

Using platform-dependent converters

For platforms with specific footprint requirements, you may want to re-implement the converter code. Please see the re-implementation documentation if you wish to do so.

Supported standards

Character encodings

Other specifications

Implementation and design

See also the API documentation.

Design principles

Streaming desing

All converters need to handle streaming data, since conversion is done on buffers of network data of different lenghts, which then are sent for incremental parsing by the code that needs to process it. Because of this, all converters that work on any unit greater than a single byte is in essence a state machine.

The state needs to be preserved between runs, and data needs to be output as soon as possible. The only internal buffering that is done is for incomplete data, such as remembering the first half of a double-byte character, or the first few characters in an ISO 2022 escape sequence.

Optimizations

Since input conversion (network data → UTF-16) is performed most often, on every page read, these converters are optimized for speed. This means that some converters that probably could be unified into single converter classes are not, although the impact of this optimization has not generally been investigated.

The UTF-8 decoder is run especially often since Opera stores many of its external data files in this encoding, so it has been even further speed optimized, using loop unrolling and replacing some if tests with table lookups. The UTF-8 decoder is implemented in the unicode module.

The output conversions (UTF-16 → network data) is not performed as often, and because of this it is generally optimized for size. To achieve this, several of the similar converters have been merged to be able to share as much code as possible.

Data tables

The exact format of the data tables are documented in the i18ndata module documentation. In general, the data tables are optimized for size, but as forward conversion is done from a small character set (the legacy encoding) to a larger (Unicode), these are generally direct look-up tables, which goes well with the speed optimization for input converters.

The output converters convert from a large character set (Unicode) to a smaller one (the legacy encoding), and having direct mapping for all of Unicode would be extremely wasteful. These encodings thus use a mixture of direct look-up tables (where that makes sense) or paired mapping tables.

Generalisation and re-implementation

The CharConverter interface, and specifically the InputConverter and OutputConverter interfaces are defined in a generic way, making it possible for platforms that wish to perform conversion in another manner to re-implement them. This can be used on platforms where the platform libraries include adequate interfaces for conversion to decrease the footprint of Opera.

Initialisation

The initialisation of the module is handled through the Opera::InitL() API.

If you are using a file-based implementation of the table manager interface, the name of the file to use can be controlled using TWEAK_ENC_TABLE_FILE_NAME. The default file name is encoding.bin.

Memory management

Decoders and encoders

The decoders and encoders do not, with a few exceptions, allocate any data for themselves, but instead use the buffers that are passed in to the Convert() method, in addition to a few state variables in each of the objects. The actual conversion tables are allocated by the Table Manager, see below.

The first exception to this rule is the Big5HKSCStoUTF16Converter class which does allocate a rather large data tables the first time it needs to convert characters outside the regular Big5 range. Since the data is large and costly to generate, it is then kept in memory until Opera is exited, except for builds where TWEAK_ENC_GENERATE_BIG_HKSCS_TABLE is disabled, in which case it will switch to a slower linear scan. The reason why this table is not in the encoding.bin file is that it is large, sparsely populated and rarely used.

The second exception is the incoming converters for UTF-16 data, which will examine the data they are handed and then delegate the conversion to a converter tailored for the byte-order they are is presented with (big or little endian).

The Table Manager

The default implementation of the OpTableManager interface, the FileTableManager reads data from the encoding.bin file. To avoid throwing out tables just to have to re-read them shortly afterwards, a LRU queue of tables is implemented in the TableCacheManager class, from which it inherits. If the FileTableManager is unable to allocate a table, it will simply return a NULL pointer, and the converter object that requested the table will run with reduced functionality. See also the next section.

The RomTableManager implementation does not itself allocate any memory, as it uses tables that are always available inside the program image. If reverse tables are omitted and/or compressed tables are used, the information in the next section applies, as the RomTableManager also inherits the TableCacheManager class.

Table Manager helpers

If enabled, the ReverseTableBuilder class can be used by TableCacheManager, which allows it to build tables at run-time. The tables it creates are subject to the same queueing rules as the on-disk tables, but the list of available tables, which is kept during the life time of the object, will grow for each new reverse table that is requested (usually one or two per outgoing character set conversion). During building the reverse tables, the ordinary (incoming) conversion tables need to be kept in memory at the same time as the generated reverse table.

If enabled, the TableDecompressor class can be used by TableCacheManager to decompressed a compressed table, which either stored in a file or in ROM. During decompression, both the compressed and decompressed version of the table is kept in memory.

The Charset Manager

CharsetManager will start up with the list of support encodings gathered from the Table Manager (or from the hard-coded alias list if not using Table Manager), and will then allocate new entries when new charset tags are added to it from the URLManager (and possibly others). There is a fixed upper limit to the amount of tags it can remember, when this number is reached, it will start trying to overwrite older entries that are no longer referenced.

The Charset Detector

CharsetDetector does not allocate memory, and can be allocated on the stack.

Stack usage

The Charset Manager contains a recursive call, but there is a check to ensure that it will only recurse one level.

Converters for endian-agnostic encodings will allocate an object for the specific endian when it has been detected and delegate the conversion to it, adding to the stack depth for the conversion call.

Static memory usage

The global objects for Charset Manager, Table Manager (if enabled) and the HKSCS data table as described above are all handled as Opera Globals in the EncodingsModule object. These will live for the entire life-time of Opera, and will be destructed by Opera::Destroy().

OOM policies

The conversion code is written so that it will degrade gracefully if the Table Manager runs of memory while allocating the conversion tables (or if the conversion table is not available, the code mostly does not distinguish between the two cases). Most out of memory situations are handled similarly; if a function or method is set to return a pointer, it will return NULL if it runs out of memory, or there is another error.

There are also a few cases where the TRAP/LEAVE convention is used. This is mainly used where there no reasonable action can be taken by the code if it runs out of memory, so it leaves handling the condition to the caller.

Performance

Encoders and decoders that are used often have been optimised for speed, this especially applies to conversion to and from UTF-8, which is used for external file storage. Otherwise, the code has mostly been written for small footprint, where the most notable exceptions have alternate implementations depending on tweaks.

See also

References