Data for character converters

The tables directory contains scripts used to produce binary conversion tables for string conversions, and some supplementary information. The scripts generate a binary file called encoding.bin, which is used by Opera through the OpTableManager code.

Generating the tables

Generating encoding.bin

There is currently no single script that generates an encoding.bin file. This is work in progress. For the time being, you will have to perform the following steps:

  1. Generate a chartables.bin file.
  2. Optional step: Remove reverse conversion tables and optionally compress the generated file by running mangle_tables.pike [--compress] chartables.bin.
  3. Convert the generated file to the new format:

If you use the mangle_tables.pike script as mentioned in step 2 above, you must set up the appropriate tweaks for enabling compression (TWEAK_ENC_COMPRESSED_TABLES) and generating reverse table dynamically (TWEAK_ENC_DYNAMIC_REV_TABLES).

You must also make sure to enable the relevant third-party features for data tables that you include. The relevant features are listed in the template tables file provided by this module.

If you are operating on big-endian data, replace chartables.bin with chartables-be.bin above. The encoding.bin file does contain a magic number/version tag/byte order marker.

Generating chartables.bin

To generate the chartables.bin file, you will need to have a working installation of Python, and run this command:

 construct path/to/tables-insertos.txt

which creates any other tables it needs in order to assemble the chartables.bin file. There is also a script called make.py which functions, for backward compatibility with older versions of the module, as an alias for construct -v (i.e. as above but producing verbose output). For a full list of the options supported by these commands, supply them with their -h flag; for expanded explanation of the meanings of options, use the --help flag.

The tables-*.txt contain a list of the tables that should be included on your platform. Please refer to the tables-all.txt for information on what it looks like and what tables are available. This is the file used by Core, you need to tell your build system which table file to use. The list of valid table names can in principle be inferred from the tables.py junction-box's table function or, rather the table_class function (and, earlier, dictionary) that it employs.

Note that construct and make.py only regenerate tables when the files on which they depend have changed, unless you pass the --makeall (short-form: -a) flag on the command-line.

The table file needs to be in the same endian as the target device expects. The script by default generates the host machine endian, if the target is different, please use the --big-endian or --little-endian flags to specify it.

You also need to make sure you have the corresponding code in Opera for the encodings you wish to support, please see the list of supported encodings.

Table formats

Container file

Current file format

The encoding.bin file is the container file that is used with the current (as of core 2.9, September 2011) encodings module. The file is stored in the machine endian, with an endian marker. If the wrong endian is detected at run-time, the encodings module will refuse to load the file. The encodings module supports reading an opposite-endian file, but that has to be enabled at compile-time, and disables reading the machine endian format.

When using ROM tables, the format of the data is the same, and is made available through the symbol g_encodingtables, which must be a global function with the signature const unsigned int * g_encodingtables(), returning a pointer pointing to a memory area with a binary image of a encoding.bin file. The function signature is defined by the romtablemanager.h header file from the encodings module.

Format of header in encoding.bin
RepetitionsSizeDescription
116 bitsMagic value, should be 0xFE01
116 bitsSize of entire header, in bytes
116 bitsNumber of tables in the file
116 bitsReserved for future expansions, should be 0
#tables + 132 bitsOffset to table, counted from the beginning of the file, last offset should equal file length
#tables32 bitsActual table lengths (after optional decompression)
#tables16 bitsOffset of table name in the header
#tables8 bitsFlags, all reserved bits must be set to zero
b7 = Data length includes a one byte padding
b0 = Table is compressed
*8 bitsName of tables (NUL-terminated ASCII)
0-18 bitsOptional padding to make the header an even number of bytes

Following the file header, the tables are stored as binary data according to the information contained in the header.

Legacy format

The chartables.bin (character tables) file is the container file that was used in all versions of Opera since version 6, until the introduction of the encodings.bin file as documented above. The file is stored in the machine endian, and since it does not contain any endian marker, it is called chartables-be.bin on big-endian systems. From Rosetta 4.1 of the encodings module and onwards, there is support for using an opposite-endian file, but this has to be enabled at compile-time, and disables reading the machine endian format.

Format of header in chartables.bin
RemarkSizeDescription
16 bitsNumber of tables
Repeated
for each
table
32 bitsSize of table
8 bitsLength of name label
8 bits × lengthName of table (ASCII)
32 bitsOffset of table in file, counted from the end of the header

Following the file header, the tables are stored as binary data according to the information contained in the header.

Current tables

There are several different tables in the container file. Below are descriptions on the format of all of them. There is also a list of tables required for each encoding.

Forward conversion table for ASCII-based single-byte encoding
128 16-bit values in machine endian describing the UTF-16 codepoints for values 0x80 through 0xFF of the input encoding. Undefined codepoints contain the value U+FFFD ("replacement character").
Tables: iso-8859-*, windows-*, ibm866, koi8-*, macintosh, x-mac-*
Forward conversion table for non-ASCII-based single-byte encoding
256 16-bit values in machine endian describing the UTF-16 codepoints for values 0x00 through 0xFF of the input encoding. Undefined codepoints contain the value U+FFFD ("replacement character").
Tables: viscii,
Forward conversion table for double-byte encoding
A list of UTF-16 codepoints in machine endian for the entire double-byte code space of the encoding. The space is shrunk so that only the assigned double-byte codes are included in the list. What those codes are depends on the encoding. Undefined codepoints contain the value U+FFFD ("replacement character").
Tables: jis-0208, jis-0212, big5-table, ksc5601-table, gbk-table, cns11643-table
Two-way conversion table for GB18030
See comments in gbk.py's gbkOffsetTable class for documentation.
Tables: gb18030-table
Reverse conversion table for single-byte encoding
A pair table of UTF-16 codepoints (16-bit values) and the single-byte representation in the target encoding (8-bit values).

These tables will be built at run-time if they are missing and FEATURE_TABLEMANAGER_DYN_REV is enabled.

Tables: iso-8859-*-rev, windows-*-rev, ibm866-rev, koi8-*-rev, macintosh-rev, x-mac-*-rev viscii-rev,
Reverse conversion tables for double-byte encodings
Reverse conversion for double-byte encodings are split into two tables:

These tables will be built at run-time if they are missing and FEATURE_TABLEMANAGER_DYN_REV is enabled.

Tables: *-rev-1, *-rev-2
Reverse conversion tables for EUC-TW
The table is split in two as above. Please see documentation in the UTF16toEUCTWConverter::Convert method for more information on the data format.

These tables will be built at run-time if they are missing and FEATURE_TABLEMANAGER_DYN_REV is enabled.

Tables: cns11643-rev-table-*
Reverse conversion tables for Big5-HKSCS
There are two tables, both in the pair format described above, one for plane 0 of Unicode (U+0000U+FFFF) and one for plane 2 (U+20000U+2FFFF). These tables are also used for forward conversion, the tables are "turned around" when needed (see the Big5HKSCStoUTF16Converter::GenerateHKSCSTable method for more information).
Tables: hkscs-plane-0, hkscs-plane-2,
Big5-HKSCS compatibility mapping table
This is a pair table listing HKSCS compatibility mappings. The first 16-bit value is a machine endian representation of the DBCS code to remap (with the first byte in the high 8 bits and the second byte in the low 8 bits), and the second 16-bit value is a swapped Big5-HKSCS code to replace it with. The table looks like the other pair tables, except that the first value is not a UTF-16 codepoint.
Tables: hkscs-compat
Unicode block table
This is a list of Unicode blocks (subranges) used in the (TrueType) font switching code. The list contains sets of three values, the first is an 8-bit block number, the following two are 16-bit Unicode codepoints describing the low and high boundary of the block.
Tables: uniblocks
Character encodings table
This is a nul-separated list of encodings ("charsets") supported by Opera. The table is generated from the list of included conversion tables plus the encodings always supported by Opera.
Tables: charsets

Relation between Microsoft codepages and character encodings

Due to legacy encodings being employed by some operating systems, which will produce output that is labelled as one standard but in reality being otherwise, some special handling of Microsoft (and IBM) code pages is being employed. The below table describes the code pages recognized by Opera.

Code pageFileNameNote
866 cp866.txt MS-DOS Cyrillic Supported as ibm866
874 cp874.txt Thai Superset of ISO 8859-11; we use this table for both iso-8859-11 and windows-874
932 cp932.txt JIS 0208 (SJIS) We use this instead for shift_jis (together with the table from Unicode); table also used for other Japanese encodings
936 cp936.txt GBK We use the GB 18030 table, also for windows-936
949 cp949.txt Korean We use this for euc-kr (instead of the tables from Unicode, ksc5601.txt and ksx1001.txt, which are incomplete)
950 cp950.txt Big5 Was the base for the big5-2003 table, which is used together with the table from Unicode
1250 cp1250.txt Central Europe Supported as windows-1250
1251 cp1251.txt Cyrillic Supported as windows-1250
1252 cp1252.txt Latin I Overrides iso-8859-1 for input (hardcoded in the encodings module)
1253 cp1253.txt Greek Supported as windows-1250
1254 cp1254.txt Turkish Supported as windows-1250
1255 cp1255.txt Hebrew Supported as windows-1250
1256 cp1256.txt Arabic Supported as windows-1250
1257 cp1257.txt Baltic Supported as windows-1250
1258 cp1258.txt Vietnam Supported as windows-1250

For reference information about Microsoft code pages, see the Go Global Development Central.

Sub-directories

The scripts generate their table files (aside from the final chartables.bin in subdirectories called plain-* or, when imode is used, imode-*, with * being either be or le according as the tables are in big-endian or little-endian order, respectively. Aside from these, you will find:

sources/
This directory contains the sources for the mapping tables.
utilities/
This directory contains some utility scripts to inspect the mappings, and more. They are not used by the scripts in this directory and can be ignored.

Scripts

The scripts which transform source data into the (binary) tables are contained in the tables subdirectory. Most scripts contain python doc-strings in strategic places, notably the file header. Where derived classes and their methods lack documentation, check the base class, whose documentation they typically presume. For convenience, the scripts are here described in two groups. There is also a more technical description available.

Primary scripts and Utilities

There are two primary script files and two high-level driver modules for the table-building process:

chrtblgen.py

Provides the function main which parses the command-line and invokes suitable Platform methods to read the tables file and emit the chartables file. Can also be used as a stand-alone script by using it as first argument to python, followed by desired further arguments.

make.py

Thin application built on chrtblgen.py for backwards compatibility with old versions of this module.

platform.py

Provides the Platform class, which supports parsing of the table-list file and drives the process of table-generation.

tables.py

Provides the junction-box which maps a table-name to a python class which implements the necessary source-file reader and table-file builder. Contains from … import … statements which indicate which of the other modules provide which implementation class.

There are several utility modules employed by the per-table scripts:

basetable.py

Provides two base-classes, Table and textTable. The former defines the generic API for tables, as driven by the Platform class of platform.py; the latter augments this with a structure for parsing the line-oriented sources/*.txt files which are used by most encodings.

checker.py

Provides experimental checker infrastructure which can be deployed by using the --check-sources flag on the command-line. This produces warnings if source files' comments describing characters are not consistent with one another.

tablefile.py

Provides the utility class, tableFile whose instances represent the binary files for individual tables. In particular, this class is where endianness issues are handled; other modules should not need to use python's built-in struct.pack function on data.

Also provides the function describe which transcribes all the individual tables into a single chartables.bin file.

tableutils.py

Provides a miscellany of utilities employed by the classes which implement the tables for particular encodings.

unhex(text) -> number
Converts a number from a hexadecimal string to an integer
NON_UNICODE
symbolic integer constant which is not a Unicode codepoint; used as a dummy to mark unused entries in tables.
twoDict
Mimic for a python dictionary which stores both forward and backward mappings; useful when source data may map more than one source form to a single codepoint. Presently only used by big5.
hexTable
Provides a parse method for the common case of a textTable whose source file has its encoding and the matching codepoint, in hex, as the first two words on each non-comment line.
byteTable
Refines textTable for the common case of single-byte tables.
sparseTable
Helper (mixin) class for tables whose entries all lie in some modest number of blocks within the entire space of possible codepoints. These are the tables whose reverse-table is split into one contiguous block and a pair-table for the rest.

Scripts to process particular tables

These likewise break into two groups; first the meta-tables:

case.py

Implements the case-transformation tables, uni_lower and uni_upper, via the class caseTable.

charsetlist.py

Implements the pseudo-table which lists all supported encodings.

unibits.py

Provides classes for the Unicode block table and the bidi mirroring table. (There is no particular reason for these to share a file; accident of history.)

Finally, the true character tables:

big5.py

Provides for the Big5 table.

cns.py

Provides for the CNS 11643 table.

gbk.py

Provides for the GBK table and the GBK offset table, by parsing gb-18030-2000.xml directly.

hkscs.py

Provides for Hong Kong's single-byte character set (HKSCS) table and the associated Big5 compatibility table. (The latter might profitably be moved to big5.py.)

jis0208.py

Provides for the JIS 0208 table; includes its imode contribution by parsing imode-emoji.html directly.

jis0212.py

Provides for the JIS 0212 table directly.

ksc.py

Provides for the KS X 1001:1992 (a.k.a. KSC 5601) table.

sbcs.py

Provides for the single-byte character sets (aside from HKSCS, above). Most of these are catered to by class sbcsTable; the VISCII table, however, require its own special case subclassed from it.

Responsible

For more information, contact