The tables directory
contains scripts used to produce binary conversion
tables for string conversions, and some supplementary information.
The scripts generate a binary file called encoding.bin, which is
used by Opera through the OpTableManager code.
encoding.bin
There is currently no single script that generates an encoding.bin
file. This is work in progress.
For the time being, you will have to perform the following steps:
chartables.bin
file.
mangle_tables.pike [--compress] chartables.bin.
If you use the mangle_tables.pike script as mentioned in step 2
above, you must set up the appropriate
tweaks
for enabling compression (TWEAK_ENC_COMPRESSED_TABLES) and
generating reverse table dynamically
(TWEAK_ENC_DYNAMIC_REV_TABLES).
You must also make sure to enable the relevant third-party features for data tables that you include. The relevant features are listed in the template tables file provided by this module.
If you are operating on big-endian data, replace
chartables.bin with chartables-be.bin above.
The encoding.bin file does
contain a magic number/version tag/byte
order marker.
chartables.bin
To generate the chartables.bin file, you will need to have a working
installation of Python, and run this command:
construct path/to/tables-insertos.txt
which creates any other tables it needs in order to assemble the
chartables.bin file. There is also a script called
make.py which functions, for backward compatibility with older
versions of the module, as an alias for construct -v (i.e. as
above but producing verbose output). For a full list of the options supported
by these commands, supply them with their -h flag; for expanded
explanation of the meanings of options, use the --help flag.
The tables-*.txt contain a list of the tables that should be
included on your platform. Please refer to the
tables-all.txt
for information on what it looks like and what tables are available. This
is the file used by Core, you need to tell your build
system which table file to use. The
list of valid table names can in principle be inferred from the
tables.py junction-box's table
function or, rather the table_class function (and, earlier,
dictionary) that it employs.
Note that construct and make.py only regenerate tables when the files
on which they depend have changed, unless you pass the --makeall
(short-form: -a) flag on the command-line.
The table file needs to be in the same endian as the target device expects.
The script by default generates the host machine endian, if the target is
different, please use the --big-endian or
--little-endian flags to specify it.
You also need to make sure you have the corresponding code in Opera for the encodings you wish to support, please see the list of supported encodings.
The encoding.bin file is the container file that is used with
the current (as of core 2.9, September 2011) encodings module.
The file is stored in the machine endian, with an endian marker.
If the wrong endian is detected at run-time, the encodings module will
refuse to load the file.
The encodings module supports reading an opposite-endian file, but that has to
be enabled at compile-time, and disables reading the machine endian format.
When using ROM tables, the format of the data is the same, and is made
available through the symbol g_encodingtables, which
must be a global function with the signature const unsigned int *
g_encodingtables(), returning a pointer pointing
to a memory area with a binary image of a encoding.bin file.
The function signature is defined by the romtablemanager.h
header file from the
encodings
module.
| Repetitions | Size | Description |
|---|---|---|
| 1 | 16 bits | Magic value, should be 0xFE01 |
| 1 | 16 bits | Size of entire header, in bytes |
| 1 | 16 bits | Number of tables in the file |
| 1 | 16 bits | Reserved for future expansions, should be 0 |
| #tables + 1 | 32 bits | Offset to table, counted from the beginning of the file, last offset should equal file length |
| #tables | 32 bits | Actual table lengths (after optional decompression) |
| #tables | 16 bits | Offset of table name in the header |
| #tables | 8 bits | Flags, all reserved bits must be set to zero |
| b7 = Data length includes a one byte padding | ||
| b0 = Table is compressed | ||
| * | 8 bits | Name of tables (NUL-terminated ASCII) |
| 0-1 | 8 bits | Optional padding to make the header an even number of bytes |
Following the file header, the tables are stored as binary data according to the information contained in the header.
The chartables.bin (character tables) file is the
container file that was used in all versions of Opera since version 6,
until the introduction of the encodings.bin file as
documented above.
The file is stored in the machine endian, and since it does not contain any
endian marker, it is called chartables-be.bin on big-endian
systems.
From Rosetta 4.1 of the encodings module and onwards, there is support for
using an opposite-endian file, but this has to be enabled at compile-time,
and disables reading the machine endian format.
| Remark | Size | Description |
|---|---|---|
| 16 bits | Number of tables | |
| Repeated for each table | 32 bits | Size of table |
| 8 bits | Length of name label | |
| 8 bits × length | Name of table (ASCII) | |
| 32 bits | Offset of table in file, counted from the end of the header |
Following the file header, the tables are stored as binary data according to the information contained in the header.
There are several different tables in the container file. Below are descriptions on the format of all of them. There is also a list of tables required for each encoding.
iso-8859-*,
windows-*,
ibm866,
koi8-*,
macintosh,
x-mac-*
viscii,
jis-0208,
jis-0212,
big5-table,
ksc5601-table,
gbk-table,
cns11643-table
gbk.py's
gbkOffsetTable class for documentation.
gb18030-table
These tables will be built at run-time if they are missing and
FEATURE_TABLEMANAGER_DYN_REV is enabled.
iso-8859-*-rev,
windows-*-rev,
ibm866-rev,
koi8-*-rev,
macintosh-rev,
x-mac-*-rev
viscii-rev,
-rev-1) contains a list of
double-byte codes from a UTF-16 base. For
historical reasons, the row-cell data is swapped so that the cell value
(second byte in output) comes first. Undefined codepoints contain two
nul bytes.
-rev-2) contains a pair table
much like the single-byte reverse table,
but which stores two 16-bit values, first the UTF-16 codepoint and then
the row-cell data (swapped just like above).
These tables will be built at run-time if they are missing and
FEATURE_TABLEMANAGER_DYN_REV is enabled.
*-rev-1,
*-rev-2
UTF16toEUCTWConverter::Convert method for more information on
the data format.
These tables will be built at run-time if they are missing and
FEATURE_TABLEMANAGER_DYN_REV is enabled.
cns11643-rev-table-*
Big5HKSCStoUTF16Converter::GenerateHKSCSTable method
for more information).
hkscs-plane-0,
hkscs-plane-2,
hkscs-compat
uniblocks
charsets
Due to legacy encodings being employed by some operating systems, which will produce output that is labelled as one standard but in reality being otherwise, some special handling of Microsoft (and IBM) code pages is being employed. The below table describes the code pages recognized by Opera.
| Code page | File | Name | Note |
|---|---|---|---|
| 866 | cp866.txt |
MS-DOS Cyrillic | Supported as |
| 874 | cp874.txt |
Thai | Superset of ISO 8859-11;
we use this table for both and
|
| 932 | cp932.txt |
JIS 0208 (SJIS) | We use this instead for (together with the
table from Unicode);
table also used for other Japanese encodings |
| 936 | cp936.txt |
GBK | We use the GB 18030 table,
also for |
| 949 | cp949.txt |
Korean | We use this for (instead of the tables from
Unicode,
ksc5601.txt and
ksx1001.txt,
which are incomplete) |
| 950 | cp950.txt |
Big5 | Was the base for the big5-2003 table, which is used together with the table from Unicode |
| 1250 | cp1250.txt |
Central Europe | Supported as |
| 1251 | cp1251.txt |
Cyrillic | Supported as |
| 1252 | cp1252.txt |
Latin I | Overrides for input (hardcoded in
the encodings
module) |
| 1253 | cp1253.txt |
Greek | Supported as |
| 1254 | cp1254.txt |
Turkish | Supported as |
| 1255 | cp1255.txt |
Hebrew | Supported as |
| 1256 | cp1256.txt |
Arabic | Supported as |
| 1257 | cp1257.txt |
Baltic | Supported as |
| 1258 | cp1258.txt |
Vietnam | Supported as |
For reference information about Microsoft code pages, see the Go Global Development Central.
The scripts generate their table files (aside from the final
chartables.bin in subdirectories called plain-* or,
when imode is used, imode-*, with * being either
be or le according as the tables are in big-endian or
little-endian order, respectively.
Aside from these, you will find:
The scripts which transform source data into the (binary) tables are contained
in the tables subdirectory.
Most scripts contain python doc-strings in strategic places, notably the file
header.
Where derived classes and their methods lack documentation, check the base
class, whose documentation they typically presume.
For convenience, the scripts are here described in two groups.
There is also a more
technical description available.
There are two primary script files and two high-level driver modules for the table-building process:
chrtblgen.py
Provides the function main which parses the command-line and invokes
suitable Platform methods to read the tables file and emit the
chartables file.
Can also be used as a stand-alone script by using it as first argument to
python, followed by desired further arguments.
make.py
Thin application built on chrtblgen.py for backwards compatibility
with old versions of this module.
platform.py
Provides the Platform class, which supports parsing of the
table-list file and drives the process of table-generation.
tables.py
Provides the junction-box which maps a table-name to a python class which
implements the necessary source-file reader and table-file builder.
Contains from … import … statements which
indicate which of the other modules provide which implementation class.
There are several utility modules employed by the per-table scripts:
basetable.py
Provides two base-classes, Table and textTable.
The former defines the generic API for tables, as driven by the
Platform class of platform.py; the latter augments
this with a structure for parsing the line-oriented sources/*.txt
files which are used by most encodings.
checker.py
Provides experimental checker infrastructure which can be deployed by using the
--check-sources flag on the command-line.
This produces warnings if source files' comments describing characters are not
consistent with one another.
tablefile.py
Provides the utility class, tableFile whose instances represent the
binary files for individual tables.
In particular, this class is where endianness issues are handled; other modules
should not need to use python's built-in struct.pack
function on data.
Also provides the function describe which transcribes all the
individual tables into a single chartables.bin file.
tableutils.py
Provides a miscellany of utilities employed by the classes which implement the tables for particular encodings.
unhex(text) -> number
NON_UNICODE
twoDict
big5.
hexTable
parse method for the common case of a
textTable whose source file has its encoding and the matching
codepoint, in hex, as the first two words on each non-comment line. byteTable
textTable for the common case of single-byte tables.sparseTable
These likewise break into two groups; first the meta-tables
:
case.py
Implements the case-transformation tables, uni_lower and
uni_upper, via the class caseTable.
charsetlist.py
Implements the pseudo-table which lists all supported encodings.
unibits.py
Provides classes for the Unicode block table and the bidi
mirroring table.
(There is no particular reason for these to share a file; accident of history.)
Finally, the true character tables:
big5.py
Provides for the Big5 table.
cns.py
Provides for the CNS 11643 table.
gbk.py
Provides for the GBK table and the GBK offset table, by parsing gb-18030-2000.xml directly.
hkscs.py
Provides for Hong Kong's single-byte character set (HKSCS) table and the
associated Big5 compatibility table.
(The latter might profitably be moved to big5.py.)
jis0208.py
Provides for the JIS 0208 table; includes its imode contribution by parsing imode-emoji.html directly.
jis0212.py
Provides for the JIS 0212 table directly.
ksc.py
Provides for the KS X 1001:1992 (a.k.a. KSC 5601) table.
sbcs.py
Provides for the single-byte character sets (aside from HKSCS, above).
Most of these are catered to by class sbcsTable; the
VISCII table, however, require its own special case subclassed from it.
For more information, contact