Copyright © 1999-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.
The Encodings module provides support for converting text to and from legacy characters encodings to the Opera internal encoding (UTF-16). It also provides a framework for loading binary data tables from an external source (the encoding.bin file).
Current information about the Encodings module.
For information on generating the data files, needed by the code in this module, and which third-party features must to be enabled when using them, please refer to the i18ndata module documentation.
The API documentation is extracted automatically by Doxygen.
Using the InputConverter::CreateCharConverter() method,
an instance of the
InputConverter
class is retrieved. The byte stream coming
in from the network is then pumped through the Convert()
method, converting the input from the legacy encoding used in the data
retrieved from the network to the internal representation
(UTF-16).
There are helper functions available in other modules to facilitate
this, the URL_DataDescriptor class will set up the proper
encoding conversion automatically when instantiated with the correct
parameters, but it is also possible to use the class directly.
Similarly to the above example, the
OutputConverter::CreateCharConverter() method
is used to create an instance of the
OutputConverter
class.
Since both it and
InputConverter
inherit from the
CharConverter
class, the Convert() interface
is identical. The
OutputConverter
converts from the internal
encoding (UTF-16) to the
legacy encoding required for the network.
Opera should normally only need to handle the internal representation
(UTF-16), but there are
some cases besides network communication where conversions into other
formats is required.
For this, the OpString class has APIs for converting strings
to and from legacy data.
If you do not wish to use OpString, the conversion classes
can easily be used on regular nul-terminated C strings, you must just
remember that all the length information submitted to the
Convert() are counted in bytes, and that you
must include the trailing nul character in the byte count
if you wish the output to be nul-terminated.
Including the trailing nul character in the count also ensures that the string is properly self-contained.
The
CharsetDetector
class is used to find out which encoding is used in a document.
There are a couple of class methods that can be used to check for HTML
meta tags, BOMs and similar.
To perform actual encoding guessing, an instance is created of the class,
and the byte stream is pushed through the PeekAtBuffer()
method.
When all the data has been pushed, the method
GetDetectedCharset() will return an encoding tag, or
NULL if none was detected.
The
CharsetDetector
class is light-weight and is usually
allocated on the stack.
For platforms with specific footprint requirements, you may want to re-implement the converter code. Please see the re-implementation documentation if you wish to do so.
See also the API documentation.
All converters need to handle streaming data, since conversion is done on buffers of network data of different lenghts, which then are sent for incremental parsing by the code that needs to process it. Because of this, all converters that work on any unit greater than a single byte is in essence a state machine.
The state needs to be preserved between runs, and data needs to be output as soon as possible. The only internal buffering that is done is for incomplete data, such as remembering the first half of a double-byte character, or the first few characters in an ISO 2022 escape sequence.
Since input conversion (network data → UTF-16) is performed most often, on every page read, these converters are optimized for speed. This means that some converters that probably could be unified into single converter classes are not, although the impact of this optimization has not generally been investigated.
The UTF-8 decoder is run especially often since Opera stores many
of its external data files in this encoding, so it has been even further
speed optimized, using loop unrolling and replacing some if
tests with table lookups.
The UTF-8 decoder is implemented in the
unicode module.
The output conversions (UTF-16 → network data) is not performed as often, and because of this it is generally optimized for size. To achieve this, several of the similar converters have been merged to be able to share as much code as possible.
The exact format of the data tables are documented in the i18ndata module documentation. In general, the data tables are optimized for size, but as forward conversion is done from a small character set (the legacy encoding) to a larger (Unicode), these are generally direct look-up tables, which goes well with the speed optimization for input converters.
The output converters convert from a large character set (Unicode) to a smaller one (the legacy encoding), and having direct mapping for all of Unicode would be extremely wasteful. These encodings thus use a mixture of direct look-up tables (where that makes sense) or paired mapping tables.
The
CharConverter
interface, and specifically the
InputConverter
and
OutputConverter
interfaces
are defined in a generic way, making it possible for platforms that wish
to perform conversion in another manner to
re-implement them.
This can be used on platforms where the platform libraries include
adequate interfaces for conversion to decrease the footprint of Opera.
The initialisation of the module is handled through the
Opera::InitL() API.
If you are using a file-based implementation of the table manager
interface, the name of the file to use can be controlled using
TWEAK_ENC_TABLE_FILE_NAME.
The default file name is encoding.bin.
The decoders and encoders do not, with a few exceptions, allocate any
data for themselves, but instead use the buffers that are passed in to
the Convert() method, in addition to a few state variables
in each of the objects.
The actual conversion tables are allocated by the Table Manager, see
below.
The first exception to this rule is the
Big5HKSCStoUTF16Converter
class which does allocate a rather
large data tables the first time it needs to convert characters outside
the regular Big5 range. Since the data is large and costly to generate, it
is then kept in memory until Opera is exited, except
for builds where TWEAK_ENC_GENERATE_BIG_HKSCS_TABLE is disabled, in
which case it will switch to a slower linear scan. The reason why this
table is not in the encoding.bin file is that it is large,
sparsely populated and rarely used.
The second exception is the incoming converters for UTF-16 data, which will examine the data they are handed and then delegate the conversion to a converter tailored for the byte-order they are is presented with (big or little endian).
The default implementation of the
OpTableManager
interface, the
FileTableManager
reads data from the
encoding.bin file. To avoid throwing out tables just to
have to re-read them shortly afterwards, a
LRU queue of tables is
implemented in the
TableCacheManager
class, from which it inherits.
If the
FileTableManager
is unable to allocate
a table, it will simply return a NULL pointer, and the
converter object that requested the table will run with reduced
functionality. See also the next section.
The
RomTableManager
implementation does not itself allocate any
memory, as it uses tables that are always available inside the program
image. If reverse tables are omitted and/or compressed tables are used,
the information in the next section applies, as the
RomTableManager
also inherits the
TableCacheManager
class.
If enabled, the
ReverseTableBuilder
class can be used by
TableCacheManager,
which allows it to build tables at
run-time. The tables it creates are subject to the same queueing rules as
the on-disk tables, but the list of available tables, which is kept during
the life time of the object, will grow for each new reverse table that
is requested (usually one or two per outgoing character set conversion).
During building the reverse tables, the ordinary (incoming) conversion
tables need to be kept in memory at the same time as the generated
reverse table.
If enabled, the
TableDecompressor
class can be used by
TableCacheManager
to decompressed a compressed table, which
either stored in a file or in ROM.
During decompression, both the compressed and decompressed version of the
table is kept in memory.
CharsetManager
will start up with the list of support
encodings gathered from the Table Manager (or from the hard-coded alias
list if not using Table Manager),
and will then allocate new entries when new charset tags
are added to it from the URLManager (and possibly others).
There is a fixed upper limit to the amount of tags it can remember, when
this number is reached, it will start trying to overwrite older entries
that are no longer referenced.
CharsetDetector
does not allocate memory, and can be
allocated on the stack.
The Charset Manager contains a recursive call, but there is a check to ensure that it will only recurse one level.
Converters for endian-agnostic
encodings will allocate an object
for the specific endian when it has been detected and delegate the
conversion to it, adding to the stack depth for the conversion call.
The global objects for Charset Manager, Table Manager (if enabled) and
the HKSCS data table as described above are all handled as Opera Globals
in the
EncodingsModule
object.
These will live for the entire life-time of Opera, and will be destructed
by Opera::Destroy().
The conversion code is written so that it will degrade gracefully if the
Table Manager runs of memory while allocating the conversion tables (or
if the conversion table is not available, the code mostly does not
distinguish between the two cases).
Most out of memory situations are handled similarly; if a function or
method is set to return a pointer, it will return NULL if it
runs out of memory, or there is another error.
There are also a few cases where the TRAP/LEAVE
convention is used.
This is mainly used where there no reasonable action can be taken by the
code if it runs out of memory, so it leaves handling the condition to the
caller.
Encoders and decoders that are used often have been optimised for speed, this especially applies to conversion to and from UTF-8, which is used for external file storage. Otherwise, the code has mostly been written for small footprint, where the most notable exceptions have alternate implementations depending on tweaks.