Grand unified regexp to-do list
$Id$


General to-do:

  - General coding standards compliance work

  - Many test cases exist as part of the ECMAScript test suite and should
    probably be moved into selftests in the regexp module, but this is
	not urgent

  - Leak checking selftests

  - Should we be compatible with Netscape's obnoxious octal escapes? See
    comment in src/regexp_parser.cpp, in RegExpParser::AtomEscapeNode

  - Make the matcher engine preemptable and make an API available to
    exploit that.

  - Add an API to RegExp::Exec to not search for a match, but to try matching
    at the specific location

  - Add an API to RegExp to set the ignore_case and multi_line flags, or make
    these required parameters to RegExp::Exec

  - The ECMAScript engine would like support for \<newline> in regular
    expressions, for the sake of syntactic consistency.  The API for this
    would need to return the number of \<newline> pairs consumed, since
    the ES lexer needs to keep track of line number.

  - The ECMAScript engine currently contains a workaround in
    ES_RegExp::CompileRegExp to handle regular expressions that
    contain literal NULs.  It would probably be better if the parser
    in the regexp module handled NULs.  (That requires a
    char-array-and-length representation of the pattern argument to
    Init().)

  - The ECMAScript engine currently contains a workaround in
    Lexer::LexRegExp for the regexp parser only accepting
    NUL-terminated pattern strings.  Things would generally be more
    efficient if the parser could instead take a char-array-and-length
    representation.


Unrecorded bugs:

  - the ECMAScript test suite hangs the tiny regex engine (or at least runs
    for a really long time).  This is a problem with the regression tests
	for 135402 and/or 141775.


BTS: 

  - Bug 167956 is related to regexp performance (though unknown if the 
    performance problem is in the matcher engine or in the code that
	uses the matcher engine.)

  - Bug 116878 wants an extension to legal syntax

  - Bug 85847 wants an extension to legal syntax


Subset (tiny) regexp engine:

  - It is unclear whether the execution characteristics of the tiny regexp engine
    are desirable.  In particular, I suspect execution stack depth can be significant.
    That would make it a worse choice than the normal engine!

  - One interesting twist is to use the tiny-regex parser with the standard engine,
    since the standard engine is correct (and small).

  - Need better documentation of the implications of the subset engine: what is lost?
    For example, how much error detection do we get?  (Probably not a lot.)


Performance:

  - Performance generally seems to be a sore point, but again it's unclear whether
    the performance bottleneck is in the matcher engine (or the compiler, for that
    matter) or in the ES_RegExp client code in the ECMAScript engine.

  - E-mail discussion regarding keeping the choice point stack smaller and 
    maybe improving performance at the same time:

	From: Jens Lindstrm <jl@opera.com>
	To: "Lars T Hansen" <lth@opera.com>
	Subject: Regexp optimization
	Date: Fri, 09 Jul 2004 13:03:52 +0200

	First of all: I may be mistaken, I don't claim to know your regexp engine (the
	one on linear_2, that is).

	But I think that the regexp "a*" matched against a long string of a's uses as
	many RE_Frame objects as there are a's.  In my regexp engine in my ecmascript
	engine I have a fairly simple optimization that avoids this in simple cases.
	I too have a structure that looks like RE_Frame which I have a stack of to use
	when backtracking.  It looks like this:

	   class Choice
	   {
	   public:
		 unsigned address;
		 unsigned index;
		 unsigned count;
		 unsigned short additional;
		 bool mark;

		 Choice *previous;
	   };

	The "address" field is a regexp bytecode address.  The "index" field is where
	in the input string the speculative match started.  The next two fields are
	the optimization.  "Count" is the number of times an equaly long substring
	was matched and "additional" is the length of that substring.  The choice
	object is "reused" by increasing the "count" field if a new choice point is
	pushed with the same address and the same length (actually, the second choice
	point pushed with the same address determines the length, the third pushed
	reuses if it too has that length).

	I don't remember what "mark" is, but it prevents the choice object from being
	reused.  I'm sure there's a good reason for that.  Probably has something to
	do with captures.  :-)

	I would guess that my "address" is roughly the same as your "op" and "matcher"
	and that my "index" might be the same as your "endindex".

	=======

	From: Lars T Hansen <lth@opera.com>
	To: Jens Lindstrm <jl@opera.com>
	Cc: "Lars T Hansen" <lth@opera.com>
	Subject: Re: Regexp optimization
	Date: Fri, 9 Jul 2004 13:17:59 +0200


	Jens Lindstrm writes:
	 > But I think that the regexp "a*" matched against a long string of a's uses as
	 > many RE_Frame objects as there are a's.

	It does, since each match is a choice point for backtracking.

	 > In my regexp engine in my ecmascript
	 > engine I have a fairly simple optimization that avoids this in simple cases.

	That might be nice.

	 > I too have a structure that looks like RE_Frame which I have a stack of to use
	 > when backtracking.  It looks like this: 
	 >
	 > ...

	I think that your idea is basically right.  It would be nice to
	reduce the number of frames used in cases like this, which are
	obviously typical.  (Captures make me queasy, though, they are just
	plain difficult.)  I'll think about this some.

	 > By the way: I'm on vacation monday and tuesday next week.

	OK.  I work next week, but then I'm on vacation for the three after
	that.  Will you be around during that time or shall we have to find a
	third DOM/JS person to pick up the pieces in our absence?

	=======

	From: Jens Lindstrm <jl@opera.com>
	To: "Lars T Hansen" <lth@opera.com>
	Subject: Re: Regexp optimization
	Date: Fri, 09 Jul 2004 13:53:38 +0200

	On Fri, 9 Jul 2004 13:17:59 +0200, Lars T Hansen <lth@opera.com> wrote:

	>  > I don't remember what "mark" is, but it prevents the choice object from being
	>  > reused.  I'm sure there's a good reason for that.  Probably has something to
	>  > do with captures.  :-)

	Actually, it had nothing to do with captures, it had something to do with
	lookahead.  I push a marked choice when I start a lookahead and pop it when
	when I end one, and popping a marked choice pops all other choices pushed
	after it.  I'm sure it is very clever, but I don't remember much about it.  :-)

  - optimize character sets (see below)

  - many optimizations mentioned in Friedl's book

  - AND flattening; OR flattening -- this limits stack depth and improves debuggability
    (essentially this is a limited form of inlining)

  - Matcher: worry about performance generally.


CHARACTER SET OPTIMIZATION:

It would be nice to preprocess class unions, so that
    [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ],
which is now executed as a linear search of 56 nodes, would
be as fast as
    [a-zA-Z].
To do this we need to keep some info in each class node about
what the node is.  A single postpass is probably best.

Note that optimizations will probably require us to worry about
NULs in strings more than now.

Character sets that are not ranges can be represented in various
ways: as strings that are searched linearly (this is done now);
as strings that are sorted and searched by binary search; as
bitmaps; and probably in other ways.  Then searching speed can
be improved if the sets are large.

It is hard to do case-insensitive matching on ranges quickly,
so we probably need to play a game of splitting ranges into
ranges that do not need case conversion.
