Regular Expression Engine

$Id$

Overview

The regexp module provides a regular expression compiler and a matcher engine. Both correspond to regular expressions as defined by the ECMAScript specification, ES-262 3rd edition (download here), plus some extensions (documented below).

Compatibility with the ECMAScript specification is a nonnegotiable requirement: the ECMAScript engine uses the regexp module for its regular expression work.

The current version, fsa_1, was extracted from the linear_b_2 and futhark ECMAScript engines. Its focus is on correctness, footprint, and low memory use, and not so much on performance.

Use cases

There are several use cases for this module:

ECMAScript regular expression matching
Regular expressions are used extensively in scripting. To support scripting, the regexp module must support the standard, but must also support common extensions, and must have good performance both for compiling and executing regular expressions.
Regular expression matching in M2
No information is available right now about how M2 uses the regexp engine
Regular expression matching in Webforms
No information is available right now about how Webforms uses the regexp engine.

APIs

The engine exports a convenient API for use by other modules that need to do basic regular expression matching, and a less convenient API for use by the ECMAScript engine.

Generated API documentation, if you've run Doxygen on this module.

Features

There are no features used or defined by this module.

Subset engine

For footprint-constrained systems the module can be configured to use a subset engine. This accepts a "useful" subset of the regular expression language, and saves a few KB of code. However, if you use this you will run into sites that don't work, and perhaps other regressions. The subset engine is not very well tested.

The subset engine is controlled by TWEAK_REGEXP_SUBSET; see the file module.tweaks for more information.

Extensions

A brief summary of extensions:

For more information search the module sources for the string REGEXP_STRICT (this setting can be controlled by TWEAK_REGEXP_STRICT). Each extension is documented in the source.

Memory use

Introduction

There are two APIs, the OpRegExp API and the RegExp API. The former API is a simple wrapper around the latter. It has no interfaces that Leave, and storage management is by simple deletion of the OpRegExp object. For this reason, only the RegExp API will be covered in the following. Most points apply to the OpRegExp API as well.

Flow

Regular expressions (RegExp objects) are simple abstract data types. One is created by invoking new RegExp and then calling the Init() method on the object. RegExp objects can be shared and contain a reference counter, initially 1. When you want to delete it, just call DecRef() on the RegExp object. If the counter is 0, the object will be deleted immediately.

If the call to Init() fails for any reason then the RegExp cannot be used. Delete it by calling DecRef().

If Init() succeeds, then the RegExp represents a compiled regular expression and it has some local heap-allocated data. These data are deleted when the RegExp object is deleted.

A RegExp object can then be used for multiple matchings. Some heap memory is allocated during matching for working storage, but is always deallocated before the matcher returns.

OOM policies

TRAP/LEAVE is used extensively for signalling OOM. This is convenient because most memory is allocated from a region data structure and not directly from the heap, so cleaning up the allocated memory after an OOM event is easy and cheap.

(Client code that does not want to handle exceptions can use the OpRegExp API instead.)

Who is handling OOM?

OOM events are always returned from calls to the APIs as exceptions and must be handled on the outside, but the module cleans up after itself on the way out and there is no need to call any cleanup actions inside the module.

Client code should be sure to delete regular expressions whose construction failed (by decrementing their reference counters on the RegExp object). Regular expression matchings that fail due to OOM will not affect the RegExp objects, and the object may be used for subsequent matchings.

Heap memory usage

Heap memory is used during compilation and during matching. The amount allocated during compilation is roughly proportional to the size, measured in characters, of the regular expression. The amount allocated during matching is in the worst case proportional to the length of the input. Some optimizations are possible to make the worst case less likely (the intensely curious are encouraged to read the to-do list).

FIXME: more here

Stack memory usage

Stack memory is used primarily during compilation. The amount used is proportional to the nesting level of the regular expression: parenthesized subexpression are parsed by a new activation of the parser. Most regular expressions are very shallow, so in practice this is not an issue. It is however possible to construct regular expressions that will blow any stack on any computer. The easiest way to fix this, should it be necessary, is to put a static bound on the number of nesting levels.

Static memory usage

The module has no global variables at this time and has no static data. No data are cached on behalf of the module, and there are no data to free on exit.

Temporary buffers

The module does not use the tempbuffers from the memory manager.

Memory tuning

Memory usage (and performance) can be tuned by adjusting some of the obscure parameters in module.tweaks, but this should not generally be necessary. See comments in module.tweaks.

Tests

Right now there are no selftests that test for leaks or OOM. However, some tests that can be used to test memory behavior are:

I plan to add some of these, and leak checks, to the selftests.

Coverage

Coverage can be had by running the ECMAScript test suite (even just the RegExp tests). Eventually many of these will be moved into the regexp module and coverage can be had by running selftest.

Design choices

By allocating most objects from regions, we take the pressure off malloc() and reduce the storage needs (no object headers are needed). Also, this simplifies memory management and virtually guarantees that leaks will not occur.

The matcher engine uses some auxiliary data structures. These too are allocated from regions, but in addition the objects in these regions are managed by a simple garbage collector (because liveness is very hard to track accurately). See the implementation documentation for more information.

Suggestions and improvements

A number of issues are mentioned in the to-do list.

Implementation

The implementation is described in general terms in a large comment block at the beginning of the file regexp_advanced_api.cpp.