Module: URL

About this module

The URL module performs, among other things the following:

Supported protocols:

Interface overview and API documentation

API documentation generated by Doxygen contains all necessary information for the external APIs.

How to use URLs

About URL objects

URLs are contained in URL objects, and represents the entire documents. All actions involving URL are initiated through the URL classes API.

A URL object consists of two pointers, one to a URL_Rep object that contains all the URL informations, and one to a URL_RelRep object (owned by the URL_Rep object) that identifies the fragment identifier part of the URL (the "#name" part of the URLname). Refernece counters ar updated in these objects each time a new URL object is created or destroyed.

URL objects are primarily created by the g_url_api->GetURL() function call, which have several options, such as creating a completely new URL, or a URL referenced relative to an already known URL.

Additionally, URL constructors can be used to create "copies" of URLs (they are not copies, just references), or adding fragment identifiers ("#name").

Documents intending to use or display the data contained by a URL must lock it, using a URL_InUse object, which will prevent destruction of the URL's data while it is being used.

Unique URLs

There are two types of URLs, normal URLs, that are always retriveable through the cache and visited URL list, and unique URL that are not accessible through the cache and visited URL list. Unique URLs are primarily used for POST form requests and are created the same way as normal URLs, but with the unique flag set to TRUE, and unlike normal URLs cannot be accessed through new calls to g_url_api->GetURL(), and new references can only be created by creating copies of a URL object. When no document locks a unique URL the cached document is deleted, and when no URL objects references it, the URL_Rep is destroyed.

Start Loading a URL

There are several fuctions available to load URLs

The preferred function is URL::LoadDocument, which will perform all necessary cache validation checks before deciding how to load, based on a caller specified policy. The other functions are now deprecated.

To resume a download URL::ResumeLoad is used. this fucntion will, if possible restart the load on the location where the loading was aborted previously.

Additionally, it is possible to create URLs that are created using the URL::WriteDocumentData() functions, but these should be used only in special cases, like email and news decoding.

URL load progress

A loading URL can post several messages, which the caller must listen for.

The messages are on the form (msg, par1, par2). par1 is ALWAYS the Id() of the URL posting the message (retrived by URL::Id() ), par2 depends on the message (msg) posted.

MSG_HEADER_LOADED
The URL has received enough information to know what the document is, the document may now decide what to do with it. par2 is non-zero if the URL has been marked URL::KIsFollowed. This is the first message in a successful load.
MSG_URL_DATA_LOADED
This is sent each time more data has been received, as long as the status is URL_LOADING. The last message is sent as the URL changes status to URL_LOADED This message is also sent by data descriptors while there are more data available from the URL and it is no longer loading (The descriptor must be initialized with a message handler). par2 is non-zero if one of the loading clients specified inline loading.
MSG_MULTIPART_RELOAD
This message is sent to inform the document that a new header and body for the same URL will be received shortly, and should replace the current display of the document as soon as it arrives. MSG_INLINE_REPLACE is also sent at the same time. The next successful loading message will be MSG_HEADER_LOADED
MSG_URL_LOADING_FAILED
An error occured during load, par2 indicates the error code. The error code ERR_SSL_ERROR_HANDLED, means that the error has already been handled, and that loading should be silently ended.
MSG_URL_INLINE_LOADING
Only sent to indicate end of loading, par2 is a HIWORD/LOWORD combination (each 16 bit long) the HIWORD is the load status (URL::KLoadStatus) of the URL, the LOWORD may be an error code. This message will be removed in the near future.
MSG_URL_LOADING_DELAYED
Indicates that the URL will take some time to load, and the document should continue as best as it can without the data being retrieved. It is sent if it takes more than 3 seconds to get a response from the server.
MSG_URL_MOVED
The URL is being redirected to a new location, which is being loaded. the URL::KMovedToURL attribute returns the target URL. par2 is the Id of the redirect target.

Retrieving data from the URL

The data from a URL are retrived by requesting a URL_DataDescriptor object from the URL by calling URL::GetDescriptor. The data can be retrived in binary raw form, without content-encoding (in case they are compressed) or in UTF-16 form (converted from the document's original encoding).

Datadescriptors can be message driven (where messages are posted if there are more data), or polling based (where it is the caller's responsibility to retrieve all pending data).

Retrieving information about the URL

A number of informations attributes exists inside a URL object, while some of these can be retrieved by dedicated functions, the primary API for retrieving and updating the attributes are the GetAttribute/GetAttributeL and SetAttribute/SetAttributeL functions.

These functions takes as argument an enumerated value that selects the actual attribute, and either returns the value, or updates it, depending on the function.

The enumerated values are grouped into lists depending on the type of data the corresponding attribute is representeted as: Unsigned integers (including enums and signed integers, which must be typecasted), strings, URLs and general "void" pointers (which must be type casted). Strings can be retrieved both as const strings (with a few exceptions, and be copied into a separate OpString* object

The enums
The attribute enums are defined as part of the URL class, and are on the form URL::KNameOfAttribute Most of the names are on the same for as the previous function, so GetNameOfAttribute(); can be replaced by [(typecast)] GetAttribute(URL::KNameOfAttribute);. The enums are arranged in groups concerning unsigned integers, strings, URLs and arbitrary typed void pointers.
GetAttribute
These functions returns the value the selected attribute. Strings are returned as const OpStringC* objects, that may be accessed directly. The results from the version returning unsigned ints must be typecasted to the appropriate type of the attribute before being used. (This approach was chosen to prevent having to created too many implementations of this function). The default return values unless specified otherwise are 0 (for integers), empty string (for string attributes), empty URLs (for the URL attributes) and NULL (for the void pointer attributes). Optionally, these fucntions can follow redirects.
GetAttributeL
This function retrieves the value of the selected string attribute, and copies it into the provided string object. In case of OOM and other problems it may LEAVE, but the default return value is an empty string. Some attributes that have a "_L" suffix are ONLY retrievable via this API function. Optionally, these fucntions can follow redirects.
SetAttributeL
These functions are used to update the value of the selected attribute. As they may have to construct new objects inside the URL these functions may LEAVE, as they may do if any other allocation needed fails. Note that some attributes cannot be set through this API, and that no failure notice will be given in these cases. These functions ONLY act on the current object.
SetAttribute
Use of these functions are not recommended, but are provided to let functions that are not yet able to LEAVE to use the SetAttributeL API without adding an enormous amout of TRAPs in the code. These functions will TRAP any LEAVEs and return them as OP_STATUS values to the caller.
URL url;

URLStatus status1 = (URLStatus) url.GetAttribute(URL::KLoadStatus); // Get the load status of the URL
URLStatus status2 = (URLStatus) url.GetAttribute(URL::KLoadStatus, TRUE); // Follow the redirect chain and get the load status of the URL at the end of the chain

OpStringC8 name1 = url.GetAttribute(URL::KName_Escaped); // Access the %XX escaped name of the URL as a const string
OpStringC name2 = url.GetAttribute(URL::KUniNamed); // Access the UTF-8 deescaped name of the URL as a const string

OpString8 name3;
url.GetAttributeL(URL::KName_Escaped, name3); // Retrive the %XX escaped name of the URL and store it in the "name3" string object

OpString name4
url.GetAttributeL(URL::KUniNamed, name4); // Access the UTF-8 deescaped name of the URL and store it in the "name4" string object


url.SetAttributeL(URL::KLoadStatus, URL_LOADED); // Set the load status of the URL

url.SetAttrubuteL(URL::KMIME_ForceContentType, "text/plain; charset=iso-8859-1"); // Force the MIME-type (and in this case, charset), of the URL

Other API's

g_url_api

g_url_api is primarily used to construct new URLs, but also contain some cookie releated, and some UI action functions.

ServerName

ServerName objects contains information about:

For each unique servername there will only exist a single ServerName object. the urlManager object maintains the servername database. URL's contains pointers to ServerName objects and to find out if two URLs are from the same server it suffices to compare the ServerName pointers from the objects.

Implementation description

API documentation generated by Doxygen contains information about the internal organization of the module.

The URL_Rep class

General layout of the URL_Rep and related classes (The list is abbreviated):

URL_Name
The name of the URL (including some flags), split into components
Flags
Flags used by all URLs even if they are not loaded
last_visited
When was the URL last visited?
reference count
Number of URL objects referencing this object
used
Number of documents or other customers that have locked access to the data
storage
Contains all data relevant to a loaded document, such as what the data is, and the actual data.
mh_list
The list of documents waiting for the data thsi URL is loading
info
Flags either specifying information needed to load the document, or the result.
local_time_loaded
When this this loadoperation start?
Content-size
Content-Type
Charset
Secure protocol information
loading
The object handling the actual load operation
storage
The object maintaining the actual file or RAM stroage of the loaded document. Some classes also processes and breaks up the document in component pieces.
Protocol specific data
These structures contains data used by the various protocols (e.g HTTP, MIME, FTP).

URL_Manager

The URL manager maintains the following

Footprint

The module is fairly large, as it requires a lot of functionality.

Various features can be enabled or disabled, either thorugh feature defines or specific defines, one example is the HTTP stack.

Due to the requirements from various modules (including the url module) and platforms it is very difficult to reduce the footprint

Dynamic memory use and OOM handling

OOM policies

Most of internal module functions handles OOM locally, and signals an OOM by raising the OOM signal in the memory manager, and aborts the current action. If appropriate a message is posted to the document.

However, much of the public API is now LEAVE based, and in those cases the caller must TRAP errors and handle them. Some internal functions will also LEAVE, but these are TRAPed internally

Who handles OOM?

In the case of LEAVE functions the caller must TRAP the errors, and handle the OOM situations. In the case of the internal functions these usually aborts their operation with an error message, and a raised status flag, which must be handled either by the caller, or the document.

Flow

Much of the module is message callback based, and these functions are not able to report OOM situations directly to the documents or UI. In these cases the current operation will be terminated, and errormessages sent.

Much of the external API is based on direct calls, but some classes do use virtual fucntions. In many cases these are LEAVE bases, and callers must TRAP them and handle them appropriately.

Heap memory usage

NOTE: these numbers tend to be estimates, not actual measurements

Unloaded URL will usually consume approximately 40 bytes, plus the URL's path segment

Loaded URL_Reps will probably,on average, use 300-400 bytes, depending on the lengh of the URL's name. URL_Reps that uses RAM cache will additionally store the entire document in RAM.

ServerName objects will usually consume less than 200 bytes per unique servername, but actual consumption depends on servername size, and to what extent authentication and secure session information is used (session information can consume at least 1 KB per port, depending on the certificate and encryption key sizes).

Cookies can consume up to 4 KB per cookie, but should usually average less than 300 bytes.

Sequence splitter and upload elements are usually not kept for long, and their allocated size depends on the number of elements and actual bodysize.

Stack memory usage

Usually large objects are allocated. In some cases sizeable objects are placed on the stack but only for shorter periods.

In most cases stack consumption should be less than 300 bytes.

Static memory usage

The module uses several global pointers, and several static members. These are, for the most part pointers:

Several of these are buffers that (alongside buffers in URL_Manager) will grow as longer URLs are encountered.

Most of the allocated objects are dleted by URL_Manager or URL_API on exit

In addition a number of compiled const arrays exists. These may be automatically converted to allocated arrays on some platforms.

Caching and freeing memory

There are calls to free unused resources on URL_Manager and URLs that can be called when needed by the memory manager.

Additionally, the URL_Manager, either directly or through the Cache_Manager (from the cache module) keeps the number of URL, ServerNames, connections, cookies. etc. within the total number and size limits specified.

Freeing memory on exit

URL_Manager and URL_API destroys all allocated URLs, connections etc.

Temp buffers
URL_Manager and URL_Name maintains several temporary buffers that are used internally.

Additionally, several places use the memory Manager's tempbuffers, primarily TempBuf2 and TempBuf2k

There is no check for external use of these buffers, and the different buffers should prevent internal collisions, unless implementations also use them in calls to/from these functions.

Memory tuning

At present there are no opportunities to tune memory use.

Tests

Selftests, but they do not check memory usage.

Coverage

Selftests, ordinary surfing.

Design choices

URL_Rep, URL_DataStorage and several other classes are independent objects owned by other objects to reduce the use of unnecesarily large objects. Common information about scheme/servername/port is stored in a single database linked from the URLs.

Improvements

Possible improvements

See also