The regexp module provides a regular expression compiler and a matcher engine. Both correspond to regular expressions as defined by the ECMAScript specification, ES-262 3rd edition (download here), plus some extensions (documented below).
Compatibility with the ECMAScript specification is a nonnegotiable requirement: the ECMAScript engine uses the regexp module for its regular expression work.
The current version, fsa_1, was extracted from the linear_b_2 and futhark ECMAScript engines. Its focus is on correctness, footprint, and low memory use, and not so much on performance.
There are several use cases for this module:
The engine exports a convenient API for use by other modules that need to do basic regular expression matching, and a less convenient API for use by the ECMAScript engine.
Generated API documentation, if you've run Doxygen on this module.
There are no features used or defined by this module.
For footprint-constrained systems the module can be configured to use a subset engine. This accepts a "useful" subset of the regular expression language, and saves a few KB of code. However, if you use this you will run into sites that don't work, and perhaps other regressions. The subset engine is not very well tested.
The subset engine is controlled by TWEAK_REGEXP_SUBSET; see the file module.tweaks for more information.
A brief summary of extensions:
For more information search the module sources for the string REGEXP_STRICT (this setting can be controlled by TWEAK_REGEXP_STRICT). Each extension is documented in the source.
There are two APIs, the OpRegExp API and the
RegExp API. The former API is a simple wrapper around
the latter. It has no interfaces that Leave, and storage management
is by simple deletion of the OpRegExp object. For this
reason, only the RegExp API will be covered in the
following. Most points apply to the OpRegExp API as
well.
Regular expressions (RegExp objects) are simple
abstract data types. One is created by invoking
new RegExp and then calling the Init()
method on the object. RegExp objects can be shared and
contain a reference counter, initially 1. When you want to delete it,
just call DecRef() on the RegExp object. If
the counter is 0, the object will be deleted immediately.
If the call to Init() fails for any reason then the
RegExp cannot be used. Delete it by calling
DecRef().
If Init() succeeds, then the RegExp
represents a compiled regular expression and it has some local
heap-allocated data. These data are deleted when the
RegExp object is deleted.
A RegExp object can then be used for multiple
matchings. Some heap memory is allocated during matching for working
storage, but is always deallocated before the matcher returns.
TRAP/LEAVE is used extensively for signalling OOM.
This is convenient because most memory is allocated from a region data
structure and not directly from the heap, so cleaning up the allocated
memory after an OOM event is easy and cheap.
(Client code that does not want to handle exceptions can use the
OpRegExp API instead.)
OOM events are always returned from calls to the APIs as exceptions and must be handled on the outside, but the module cleans up after itself on the way out and there is no need to call any cleanup actions inside the module.
Client code should be sure to delete regular expressions whose
construction failed (by decrementing their reference counters on the
RegExp object). Regular expression matchings that fail
due to OOM will not affect the RegExp objects, and the
object may be used for subsequent matchings.
Heap memory is used during compilation and during matching. The amount allocated during compilation is roughly proportional to the size, measured in characters, of the regular expression. The amount allocated during matching is in the worst case proportional to the length of the input. Some optimizations are possible to make the worst case less likely (the intensely curious are encouraged to read the to-do list).
FIXME: more here
Stack memory is used primarily during compilation. The amount used is proportional to the nesting level of the regular expression: parenthesized subexpression are parsed by a new activation of the parser. Most regular expressions are very shallow, so in practice this is not an issue. It is however possible to construct regular expressions that will blow any stack on any computer. The easiest way to fix this, should it be necessary, is to put a static bound on the number of nesting levels.
The module has no global variables at this time and has no static data. No data are cached on behalf of the module, and there are no data to free on exit.
The module does not use the tempbuffers from the memory manager.
Memory usage (and performance) can be tuned by adjusting some of the obscure parameters in module.tweaks, but this should not generally be necessary. See comments in module.tweaks.
Right now there are no selftests that test for leaks or OOM. However, some tests that can be used to test memory behavior are:
I plan to add some of these, and leak checks, to the selftests.
Coverage can be had by running the ECMAScript test suite (even just
the RegExp tests). Eventually many of these will be
moved into the regexp module and coverage can be had by running
selftest.
By allocating most objects from regions, we take the pressure off
malloc() and reduce the storage needs (no object headers
are needed). Also, this simplifies memory management and virtually
guarantees that leaks will not occur.
The matcher engine uses some auxiliary data structures. These too are allocated from regions, but in addition the objects in these regions are managed by a simple garbage collector (because liveness is very hard to track accurately). See the implementation documentation for more information.
A number of issues are mentioned in the to-do list.
The implementation is described in general terms in a large comment block at the beginning of the file regexp_advanced_api.cpp.