(if you don't like the colors, check out the alternate stylesheets or create your own:)

Markup table generation

Instead of storing the strings for the names of elements and attributes Opera has numeric codes that maps to string tables with the canonical names. The tables used for names and type codes for elements and attribute are generated at build-time by the script modules/logdoc/scripts/mkmarkup.py. The generated tables are based on the information found in files called module.markup, which can be put in any module in the tree. The generated files will combine all the element and attribute names found in all the module.markup files.

The files generated resides in the modules/logdoc/src/html5 folder. The following files are generated:

elementtypes.h
Type codes for element names.
elementnames.h
The strings for canonical names in flattened lowercase (for when parsed as HTML), mixed case (for when parsed as XML or foreign content in HTML) and uppercase (returned by DOM in some cases).
attrtypes.h
Type codes for all attribute names.
attrnames.h
The strings for canonical names in flattened lowercase (for when parsed as HTML), mixed case (for when parsed as XML or foreign content in HTML) and uppercase (returned by DOM in some cases)

In addition there are two files elementhashbase.h and attrhashbase.h in the same folder which are also generated by the same script but usually not during building.

The tables

g_html5_tag_names

Holds all flattened and mixed case strings of all the elements. Each entry that has a mixed case representation also has a namespace (Markup::Ns) field indicating when to use the mixed case version.

The table is a one dimentional array since the length of the entires can differ quite a lot. The g_html5_tag_indices gives the offset into the array for each type.

g_html5_tag_names_upper

Holds the uppercased versions of all the element names. Used by some DOM functions.

g_html5_attr_names

Holds all flattened and mixed case strings of all the attributes. Each entry that has a mixed case representation also has a namespace (Markup::Ns) field indicating when to use the mixed case version.

The table is a one dimentional array since the length of the entires can differ quite a lot. The g_html5_attr_indices gives the offset into the array for each type.

g_html5_attr_names_upper

Holds the uppercased versions of all the attribute names. Used by some DOM functions.

The Markup class

The Markup class acts as a namespace for the element and attribute type constants. It contains enums for element codes (Markup::Type, attribute codes (Markup::AttrType) and namespace constants (Markup::Ns) used by the HTML 5 parser.

Markup::Type

A long enum representing all the normal elements and the special ones. Entries are categorized as real elements (with a string representation), special elements (without a string representation), placeholders (for keeping the values increasing monotonically) and delimiters.

Real elements

Real elements will have the format prefix + 'E_' + name, where prefix is to indicate which namespace it usually belongs to, like HT for HTML or SVG for SVG, and name is the uppercased name of the element name string it represents. Example: HTE_SPAN for the span element.

Special elements

Special elements does not have a normal string representation of the name (like HTE_TEXTGROUP), even though some of them can appear in markup (like HTE_DOCTYPE). They will have the same format as the real elements, except the name part will not be matching any string representation. They will all be placed after the HTE_LAST entry in the enum. Some of the special elements are even more special in that they are sometimes considered to be real elements for layout purposes (like HTE_TEXT and they are placed between HTE_LAST and HTE_FIRST_SPECIAL. Sounds confusing? It is ;-)

Placeholders

Placeholders are inserted after the first of several consecutive entries of the same string. If more than one namespace has the same string, an entry with the different prefixes will be added to the enum, but they will all have the same numerical value as the first entry. To make the next unique entry have the value immediately following the last one, it will be assigned the value the placeholder has. The format for placeholder entries is next + '__PLACEHOLDER' where next is the name of the next unique entry (so that it is easy to see which element it really is in a debugger).

Delimiters

Some of the entries are really just there to hold the value of some important properties of the table, like where it starts and where it ends. The value of the first entry will always be HTE_FIRST. The last of the normal elements will have the value of HTE_LAST and the absolutely last, after all the special elements, will be HTE_ABSOLUTELY_LAST. The group of really special elements will be between HTE_LAST and HTE_FIRST_SPECIAL.

Markup::AttrType

A long enum representing all the normal attributes and the special ones. Entries are categorized as real attributes (with a string representation), special attributes (without a string representation), placeholders (for keeping the values increasing monotonically) and delimiters.

Real attributes

Real attributes will have the format prefix + 'A_' + name, where prefix is to indicate which namespace it usually belongs to, like H for HTML or W for WML, and name is the uppercased name of the attribute name string it represents. Example: HA_HREF for the href attribute.

Special attributes

Special attributes does not have a normal string representation of the name (like HA_XML). They will have the same format as the real attributes, except the name part will not be matching any string representation. They will all be placed after the HA_LAST entry in the enum.

Placeholders

Placeholders are inserted after the first of several consecutive entries of the same string. If more than one namespace has the same string, an entry with the different prefixes will be added to the enum, but they will all have the same numerical value as the first entry. To make the next unique entry have the value immediately following the last one, it will be assigned the value the placeholder has. The format for placeholder entries is next + '__PLACEHOLDER' where next is the name of the next unique entry (so that it is easy to see which attribute it really is in a debugger).

Delimiters

Some of the entries are really just there to hold the value of some important properties of the table, like where it starts and where it ends. The value of the first entry will always be HA_FIRST. The last of the normal attributes will have the value of HA_LAST and the absolutely last, after all the special attributes, will be HA_ABSOLUTELY_LAST.

Markup::Ns

This enum holds the constants for the namespaces used by the HTML 5 parser.

module.markup

All elements and attributes are specified in a module.markup file that can be placed in the root directory of any module. This means that any module can specify their own elements or attributes.

Format

The format of the module.markup file is an XML application with the DTD specified below.

DTD

<!ELEMENT markup (elements | attributes)> <!ELEMENT elements (elm)> <!ATTLIST elements prefix CDATA #REQUIRED ns CDATA #REQUIRED > <!ELEMENT attributes (attr)> <!ATTLIST attributes prefix CDATA #REQUIRED ns CDATA #REQUIRED > <!ELEMENT elm EMPTY> <!ATTLIST elm name CDATA #REQUIRED str CDATA #REQUIRED > <!ELEMENT attr EMPTY> <!ATTLIST attr name CDATA #REQUIRED str CDATA #REQUIRED >

The markup element

This element is just a placeholder for the other elements.

It has no attributes.

It can contain any number of elements or attributes elements.

The elements element

This element surrounds a group of elements with the same namespace and code prefix.

It has two attributes that must be specified: prefix and ns

prefix
Contains the prefix that will precede the element type code. The value will be uppercased when used. Example: prefix="HT" will yield a code name like HTE_MYELEMENT.
ns
Contains a constant describing the XML namespace that the element will be used in. This is used for differentiating between the case flattened or original string. The value for this attribute is the same as in the Markup::Ns enum found in modules/logdoc/markup.h.

It can contain any number of elm elements.

The attributes element

This element surrounds a group of attributes with the same namespace and code prefix.

It has two attributes that must be specified: prefix and ns

prefix
Contains the prefix that will precede the attribute type code. The value will be uppercased when used. Example: prefix="H" will yield a code name like HA_MYATTRIBUTE.
ns
Contains a constant describing the XML namespace that the attribute will be used in. This is used for differentiating between the case flattened or original string. The value for this attribute is the same as in the Markup::Ns enum found in modules/logdoc/markup.h.

It can contain any number of attr elements.

The elm element

This element describes an entry in the element name and code tables.

It has two attributes which of at least one must be specified: str and name

str
Case sensitive. This attribute is used to specify a normal element that has a string representation. Example: the "a" element in HTML.
name
This attribute is used to specify an element that has no normal string representation, but can be inserted into the tree as a placeholder or elements that behave in a special way like a processing instruction or doctype. Example: The SVGE_BASE_SHADOWROOT element.

This element cannot have any content.

The attr element

This element describes an entry in the attribute name and code tables.

It has two attributes that must be specified: prefix and ns

str
Case sensitive. This attribute is used to specify a normal attribute that has a string representation. Example: the "viewBox" attribute in SVG.
name
This attribute is used to specify an attribute that has no normal string representation, but can be set on an element as a placeholder or attribute that behave in a special way like holding both the name and value of an unknown attribute. Example: The ANIMATED_MARKER_PROP attribute.

This element cannot have any content.

Example:

<?xml version="1.0"?> <markup> <elements prefix="HT" ns="HTML"> <elm name="MY_SPECIAL_ELM"/> <elm str="div"/> </elements> <elements prefix="SVG" ns="SVG"> <elm str="niceElm"/> </elements> <attributes prefix="W" ns="WML"> <attr name="MY_SPECIAL_ATTR"/> <attr str="niceAttr"/> </attributes> </markup> That file will generate the following data (in different tables, this is just a short-hand notation): elements: {Markup::HTE_DIV, "div", "DIV"} {Markup::SVGE_NICEELM, ns == Markup::SVG ? "niceElm" : "niceelm", "NICEELM"} {Markup::HTE_MY_SPECIAL_ELM, "", "", ""} attributes: {Markup::WA_NICEATTR, ns == Markup::WML ? "niceAttr" : "niceattr", "NICEATTR"} {Markup::WA_MY_SPECIAL_ATTR, "", "", ""}

Case flattening

The HTML 5 specification says that, during tokenization, all element and attribute names should be treated as lowercase. That means that if a document contains <A hRef="fisk">Fish</a> the resulting element would be named a, have an attribute called href and the end tag would match the start tag.

In order to be able to get SVG and MathML elements with mixed case to work, there is an adjustment process for elements in foreign content. If the element or attribute names match a certain list of names, it will be replaced with the correct case before the element is created.

Example: <svg><radialGradient gradientUnits="foo"></svg> would be tokenized to: | svg | radialgradient | gradientunits = "foo" and when inserted into the tree, the names would be adjusted to (extra elements inserted automatically by the tree building process): | html | head | body | svg | radialGradient | gradientUnits = "foo"

If the mixed case names are used on an element with another namespace than it was meant for on the other hand (like when SVG elements are used in a HTML context), the names should not be adjusted.

Example: <html><radialGradient gradientUnits="foo"></html> would be tokenized to: | html | radialgradient | gradientunits and when inserted into the tree, the element is in the HTML namespace and no adjustment will take place and the resulting element will be (extra elements inserted automatically by the tree building process): | html | head | body | radialgradient | gradientunits = "foo" and it will of course not work as an SVG element anymore.

For this reason the name tables have both a lowercase entry and a mixed case entry for the names that are not the same when lowercased. Special case: The textarea element exists in both HTML and SVG, but the element is called textArea in SVG. This is treated as a special case in the markup.py script where the HTML entry in module.markup is set to textArea as well, but the real representation will be textarea in HTML.

Some DOM EcmaScript functions, like Element.tagName, are specified to return the uppercased version of the element or attribute names, so we have a table with the uppercased names as well to avoid having to do the transformation on-the-fly.


2011-08-30, stighal