Re-implementing encoding conversion support

Copyright © 1999-2012 Opera Software ASA. All rights reserved. This file is part of the Opera web browser. It may not be distributed under any circumstances.

Introduction

For platforms with specific footprint requirements, you may want to re-implement the converter code. By setting FEATURE_TABLEMANAGER to NO you can remove the code for the internal converters that are table-driven. A few algorithmic converters will still be included, please see below.

Factories

When disabling support for the table driven converters, the default factories will be disabled, which means that you will need to implement them for your platform. There are two factories: InputConverter::CreateCharConverter_real() creates instance of InputConverter, i.e decoders, and OutputConverter::CreateCharConverter() creates instance of OutputConverter, i.e encoders.

Special assumptions

The Opera core code assumes that when it requests a decoder for iso-8859-1 it does instead get one for windows-1252, as there are several web sites that are mislabeled. However, when requesting an encoder for iso-8859-1, it must receive such an encoder, and not the 1252 variant.

The decoder for UTF-8 must support passing NULL as the destination buffer parameter, and then return the number of bytes needed to perform the conversion. The Opera core implementation for UTF-8 does this, and it is recommended that this implementation is used.

Implemented quirks

See the Wiki page for various other quirks that are implemented in the Opera encodings support, together with the reasoning behind them. It should not be necessary to mimic most of those in re-implementations.

Implementing converters

For best results, the platform converters should be forgiving for input errors, as Opera will most certainly encounter pages with mis-identified encodings, or containing garbage data. Opera converters never throw exceptions (leave) or give up when encountering faulty data, they simply flag these conversion errors using the internal APIs and continue as if nothing happened. Since it is the converters themselves that have knowledge on what proper data would look like, they are much better at performing error-recovery than the client code.

API requirements

Converters inherit from the generic CharConverter interface via InputConverter and OutputConverter classes. Because their fields of use are slightly different, the set of additional APIs differ between them.

What Opera assumes from converters

To properly support FEATURE_USE_ENTITIES_IN_FORMS or API_ENC_UNCONVERTIBLE, your encoders will need to identify missing codepoints and to signal them using the proper APIs. For stateful encodings, you will need to switch back to ASCII before outputting an entity. The feature can safely be turned off if you do not support it; using entities in forms is a non-standard extension. Care must be taken to support the reporting of unconvertible characters, as this might be enabled by enabling other features or tweaks. It is known to be imported by API_XMLUTILS_XMLTOSTRINGSERIALIZER and Opera Mail.

Supplied converters

No matter what, the encodings will supply converters for converting from and to ISO 8859-1, UTF-8 and UTF-16. There is also an implementation of UTF-7 that will be included if the corresponding API_ENC_UTF7 is enabled (please note that UTF-7 is not used on web pages, but may still be used in email and other contexts). All these conversions are pure algorithmic, and are the recommended converters for use even when platform converters are used.