Module: Pubsuffix

About this module

Pubsuffix will retrieve XML documents from an online repository that describes the domain hierarchy structure of individual Top Level Domains (TLDs), providing a reasonably good way to tell registry-like domains like co.uk and vgs.no from ordinary domains like bbc.co.uk and vg.no, allowing this information to be used as a foundation for security policies for cookies, Javascript (document.domain), and other features.

If no online specification can be found, a DNS based heuristic is used to guess what type of domain we are looking at.

This module is not be generally accessed, the primary API is ServerName::GetDomainTypeASync(), which will either return the domain type of the servername, or the initiate the asynchrounous operation(s) to retireve and decode the data.

The XML fileformat is described in draft-pettersen-subtld-structure.txt, and IETF Internet-Draft, and the DNS heuristic is described in draft-pettersen-dns-cookie-validate.txt.

Interface overview and API documentation

API documentation generated by Doxygen contains all necessary information for the external APIs.

Public API

ServerName::GetDomainTypeASync()

(defined in url module) This function either returns the determined type of domain immediately, or ServerName::DOMAIN_WAIT_FOR_UPDATE if the type is not yet available. ServerName::GetCurrentDomainType() can be used to retrieve the current value, without triggering the lookup.

If DOMAIN_WAIT_FOR_UPDATE is returned, the caller must register a callback, and wait for a MSG_PUBSUF_FINISHED_AUTO_UPDATE_ACTION (Id is 0) message before retrying. Multiple attempts may be needed, since this message does not specify any unique ID.

ServerName::DOMAIN_UNKNOWN

The domain type is unknown.

ServerName::DOMAIN_NORMAL

The domain is (believed to be) a normal host or commercial domain like www.opera.com or opera.com.

ServerName::DOMAIN_REGISTRY

The domain is (believed to be) a registry-like domain like co.uk, vgs.no, or city.state.us

ServerName::DOMAIN_TLD

The domain is a Top Level Domain

Semi-Public API

PubsuffixModule::CheckDomainASync()

(Accessed through g_pubsuf_api->CheckDomainASync() ) Main engine for ServerName::GetDomainTypeASync().

Initiates the loading and parsing of the specification files.

Returns OpRecStatus::FINISHED if the operation completed immediately, OpStatus::OK if a lookup is started (in which case MSG_PUBSUF_FINISHED_AUTO_UPDATE_ACTION will be posted when it is completed), or an OpStatus error code if there was an error.

Implementation description

API documentation generated by Doxygen contains information about the internal organization of the module.

Core APIs

PubsuffixModule

This is the module object. It contains a list over TLDs that have been handled in this session. It may also (optionally) contain a list of override URLs for some TLDs.

CheckDomainASync()

Starts the process of retrieving the domain type of the requested domain. Operation can be asynchonous. See above for description. While looking up a TLD all other request for the same TLD will be blocked (in the case of the XML file they will be available at the same time as the first to request information.

HaveCheckedDomain()

Have the specification of the identified TLD been checked in this session?

SetHaveCheckedDomain()

Mark the speciifed TLD as checked in this session; Prevents multiple requests in the same session, particularly for failed requests.

AddUpdateOverride()

(Import using API_PUBSUFFIX_OVERRIDE_UPDATE) Adds a override URL for a given TLD. Such overrides are NOT expected to be digitally signed.

PublicSuffix_Updater

This class handles the actual download and parsing of the TLD's XML specification, and also manages the DNS based fallback handler.

The updater will download the file, if it has not been downloaded already, and is still current, it will then process the parsed XML file, and if that fails initiate an attempt to check the type of the requested domain name using the DNS fallback.

Construction

The object is initialized using the TLD and the domain name that triggered the action. The Construct() step either creates a default URL on the online repository, or based on a specified override URL for this sepcific TLD.

StartLoading()

This will either start loading the specified URL, or if it has already been loaded, process the document immediately, and return. If necessary, a fallback is initiated.

ProcessFile()

This function parses the file, and sets all ServerNames in the specified TLD to the appropriate type.

SetFinished()

This will, in in case of failure by the XML step, intiate a DNS fallback. Otherwise it will complete the operation and indicate that the request is finished.

DNS_RegistryCheck_Handler

This is a fallback mechanism using DNS to determine if a given domain name is an ordinary name, or a registry like domain. The rule is that any domain name with an IP-address is considered an ordinary domain, one without is considered a registory-like domain.

In cases where a proxy is configured for a host, a HTTP HEAD request is sent instead of just doing a name lookup. This is because a DNS request for a proxied host might not return a valid result, as is the case when the client is behind a very thight firewall.

Construction

Initialized using the hostname to check

Start_Lookup

Starts the lookup process. Normal use is to create a Comm object that will just do a DNS name lookup. When a proxy is configured a URL http://domainname/ with method HEAD is created, and used to request the information; if a proxy is not configured for that particular host, then the Comm method is used instead.

Server

The XML files downloaded from the online repository are generated by the script server/converter.py, with assistance from the digital signature generator Python extension in server/signer.cpp (extension created using the distutils script in server/build_signer.py).

The current input to the converter script is the Mozilla Public Suffix list.

Footprint

The module is fairly small, about 10KB. Most of the processing is imported and performed by other modules, url and xmlfragment.

Dynamic memory use and OOM handling

OOM policies
>

In OOM situations the current operation is aborted and a failure notice is given to the action's owner, either by message or OpStatus return value. Very few functions LEAVE.

An OOM condition is raised when detected in the module itself

Who handles OOM?

Currently OOM is handled locally by aborting the operation. In some operations, when it is possible, the caller is informed of the status

Flow

Much of the module is message callback based, and these functions are not able to report OOM situations directly to the documents or UI. In these cases the current operation will be terminated, and errormessages sent.

Heap memory usage

The module contains a list of TLD domain that have been checked in the current session, these data are currently restricted to a single string for each TLD in a linked list (size about less than 32 plus an allocated string less than 8 bytes long). Usually a user will only visit at most a dozen or two different TLDs in a session, most of them having two or three characters in the name, meaning that typical heap memory use will be less than a 2 KB

Optionally (API must be enabled), the module may also have a list of TLDs that have override URLs specified. The memory usage will primarily be decided by the string length of the URL.

The module also defines a cache context used to store the URLs containing the pubsuffix data (these are stored for 30 days in disk cache)

Stack memory usage

The pubsuffix parsing operation is recursive, but the number of levels is determined by the document being parsed. Usually limited to less than 6

Static memory usage

Only what is contained in the module object, descibed above

Caching and freeing memory

Caching is performed by the url and cache module in a separate context, subject to their caching policies.

Freeing memory on exit

The list of checked TLDs is release automatically on exit, as is the cache context

Temp buffers
None
Memory tuning

None directly available; memory usage may be controlled by the cache and URL modules.

Tests

Selftest in module and imported from cookies

Coverage

Selftest in module and imported from cookies

Design choices

The specifications are stored as individual files on the remote repository. The files are digitally signed.

The parsing only updates the servernames currently in use. Checks on new names require a new parsing of the file.

Improvements

The specifications are stored as individual files on the remote repository. It may be that the combining all the files into one may reduce accesses to the server, and reduce

Handling of error situations might be improved, as might cache flushing.