Instead of storing the strings for the names of elements and attributes Opera has numeric codes that maps to string tables with the canonical names. The tables used for names and type codes for elements and attribute are generated at build-time by the script modules/logdoc/scripts/mkmarkup.py. The generated tables are based on the information found in files called module.markup, which can be put in any module in the tree. The generated files will combine all the element and attribute names found in all the module.markup files.
The files generated resides in the modules/logdoc/src/html5 folder. The following files are generated:
In addition there are two files elementhashbase.h and attrhashbase.h in the same folder which are also generated by the same script but usually not during building.
Holds all flattened and mixed case strings of all the elements. Each entry that has a mixed case representation also has a namespace (Markup::Ns) field indicating when to use the mixed case version.
The table is a one dimentional array since the length of the entires can differ quite a lot. The g_html5_tag_indices gives the offset into the array for each type.
Holds the uppercased versions of all the element names. Used by some DOM functions.
Holds all flattened and mixed case strings of all the attributes. Each entry that has a mixed case representation also has a namespace (Markup::Ns) field indicating when to use the mixed case version.
The table is a one dimentional array since the length of the entires can differ quite a lot. The g_html5_attr_indices gives the offset into the array for each type.
Holds the uppercased versions of all the attribute names. Used by some DOM functions.
The Markup class acts as a namespace for the element and attribute type constants. It contains enums for element codes (Markup::Type, attribute codes (Markup::AttrType) and namespace constants (Markup::Ns) used by the HTML 5 parser.
A long enum representing all the normal elements and the special ones. Entries are categorized as real elements (with a string representation), special elements (without a string representation), placeholders (for keeping the values increasing monotonically) and delimiters.
Real elements will have the format prefix + 'E_' + name, where prefix is to
indicate which namespace it usually belongs to, like HT for HTML or SVG for
SVG, and name is the uppercased name of the element name string it
represents. Example: HTE_SPAN for the span element.
Special elements does not have a normal string representation of the name (like HTE_TEXTGROUP), even though some of them can appear in markup (like HTE_DOCTYPE). They will have the same format as the real elements, except the name part will not be matching any string representation. They will all be placed after the HTE_LAST entry in the enum. Some of the special elements are even more special in that they are sometimes considered to be real elements for layout purposes (like HTE_TEXT and they are placed between HTE_LAST and HTE_FIRST_SPECIAL. Sounds confusing? It is ;-)
Placeholders are inserted after the first of several consecutive entries of the same string. If
more than one namespace has the same string, an entry with the different prefixes will be added to
the enum, but they will all have the same numerical value as the first entry. To make the next
unique entry have the value immediately following the last one, it will be assigned the value the
placeholder has. The format for placeholder entries is next + '__PLACEHOLDER'
where next is the name of the next unique entry (so that it is easy to see which element
it really is in a debugger).
Some of the entries are really just there to hold the value of some important properties of the table, like where it starts and where it ends. The value of the first entry will always be HTE_FIRST. The last of the normal elements will have the value of HTE_LAST and the absolutely last, after all the special elements, will be HTE_ABSOLUTELY_LAST. The group of really special elements will be between HTE_LAST and HTE_FIRST_SPECIAL.
A long enum representing all the normal attributes and the special ones. Entries are categorized as real attributes (with a string representation), special attributes (without a string representation), placeholders (for keeping the values increasing monotonically) and delimiters.
Real attributes will have the format prefix + 'A_' + name, where prefix is
to indicate which namespace it usually belongs to, like H for HTML or W for
WML, and name is the uppercased name of the attribute name string it
represents. Example: HA_HREF for the href attribute.
Special attributes does not have a normal string representation of the name (like HA_XML). They will have the same format as the real attributes, except the name part will not be matching any string representation. They will all be placed after the HA_LAST entry in the enum.
Placeholders are inserted after the first of several consecutive entries of the same string. If
more than one namespace has the same string, an entry with the different prefixes will be added to
the enum, but they will all have the same numerical value as the first entry. To make the next
unique entry have the value immediately following the last one, it will be assigned the value the
placeholder has. The format for placeholder entries is next + '__PLACEHOLDER'
where next is the name of the next unique entry (so that it is easy to see which
attribute it really is in a debugger).
Some of the entries are really just there to hold the value of some important properties of the table, like where it starts and where it ends. The value of the first entry will always be HA_FIRST. The last of the normal attributes will have the value of HA_LAST and the absolutely last, after all the special attributes, will be HA_ABSOLUTELY_LAST.
This enum holds the constants for the namespaces used by the HTML 5 parser.
All elements and attributes are specified in a module.markup file that can be placed in the root directory of any module. This means that any module can specify their own elements or attributes.
The format of the module.markup file is an XML application with the DTD specified below.
<!ELEMENT markup (elements | attributes)>
<!ELEMENT elements (elm)>
<!ATTLIST elements
prefix CDATA #REQUIRED
ns CDATA #REQUIRED
>
<!ELEMENT attributes (attr)>
<!ATTLIST attributes
prefix CDATA #REQUIRED
ns CDATA #REQUIRED
>
<!ELEMENT elm EMPTY>
<!ATTLIST elm
name CDATA #REQUIRED
str CDATA #REQUIRED
>
<!ELEMENT attr EMPTY>
<!ATTLIST attr
name CDATA #REQUIRED
str CDATA #REQUIRED
>
This element is just a placeholder for the other elements.
It has no attributes.
It can contain any number of elements or attributes elements.
This element surrounds a group of elements with the same namespace and code prefix.
It has two attributes that must be specified: prefix and ns
It can contain any number of elm elements.
This element surrounds a group of attributes with the same namespace and code prefix.
It has two attributes that must be specified: prefix and ns
It can contain any number of attr elements.
This element describes an entry in the element name and code tables.
It has two attributes which of at least one must be specified: str and name
This element cannot have any content.
This element describes an entry in the attribute name and code tables.
It has two attributes that must be specified: prefix and ns
This element cannot have any content.
<?xml version="1.0"?>
<markup>
<elements prefix="HT" ns="HTML">
<elm name="MY_SPECIAL_ELM"/>
<elm str="div"/>
</elements>
<elements prefix="SVG" ns="SVG">
<elm str="niceElm"/>
</elements>
<attributes prefix="W" ns="WML">
<attr name="MY_SPECIAL_ATTR"/>
<attr str="niceAttr"/>
</attributes>
</markup>
That file will generate the following data (in different tables, this is just a short-hand notation):
elements:
{Markup::HTE_DIV, "div", "DIV"}
{Markup::SVGE_NICEELM, ns == Markup::SVG ? "niceElm" : "niceelm", "NICEELM"}
{Markup::HTE_MY_SPECIAL_ELM, "", "", ""}
attributes:
{Markup::WA_NICEATTR, ns == Markup::WML ? "niceAttr" : "niceattr", "NICEATTR"}
{Markup::WA_MY_SPECIAL_ATTR, "", "", ""}
The HTML 5 specification says that, during tokenization, all element and attribute names should be treated as lowercase. That means that if a document contains <A hRef="fisk">Fish</a> the resulting element would be named a, have an attribute called href and the end tag would match the start tag.
In order to be able to get SVG and MathML elements with mixed case to work, there is an adjustment process for elements in foreign content. If the element or attribute names match a certain list of names, it will be replaced with the correct case before the element is created.
<svg><radialGradient gradientUnits="foo"></svg>
would be tokenized to:
| svg
| radialgradient
| gradientunits = "foo"
and when inserted into the tree, the names would be adjusted to (extra elements inserted
automatically by the tree building process):
| html
| head
| body
| svg
| radialGradient
| gradientUnits = "foo"
If the mixed case names are used on an element with another namespace than it was meant for on the other hand (like when SVG elements are used in a HTML context), the names should not be adjusted.
<html><radialGradient gradientUnits="foo"></html>
would be tokenized to:
| html
| radialgradient
| gradientunits
and when inserted into the tree, the element is in the HTML namespace and no adjustment will take place and the resulting element will be (extra elements inserted automatically by the tree building process):
| html
| head
| body
| radialgradient
| gradientunits = "foo"
and it will of course not work as an SVG element anymore.
For this reason the name tables have both a lowercase entry and a mixed case entry for the names that are not the same when lowercased. Special case: The textarea element exists in both HTML and SVG, but the element is called textArea in SVG. This is treated as a special case in the markup.py script where the HTML entry in module.markup is set to textArea as well, but the real representation will be textarea in HTML.
Some DOM EcmaScript functions, like Element.tagName, are specified to return the uppercased version of the element or attribute names, so we have a table with the uppercased names as well to avoid having to do the transformation on-the-fly.