Thomas Castleman's website.
In May, I started working on a small regular expressions engine.
This implementation is naive in that it constructs/simulates nondeterministic finite automata (which have equivalent power to regular expressions themselves) in order to perform several textual functions. Most modern engines convert to deterministic finite automata or employ backtracking, neither of which this implementation includes.
My approach was largely inspired by reading Michael Sipser's Introduction to the Theory of Computation, which I highly recommend (especially for its treatment of regular/context-free languages).
The source can be found here.
The full syntax supported by this engine is detailed here.
Input strings are parsed using recursive descent, so sub-expressions are parsed recursively and are used to construct larger outer expressions.
The following grammar was used as the basis for parsing:
<regex> := <term> '|' <regex> | <term> <term> := { <factor> } <factor> := <base> <unary-op> | <base> <unary-op> := '*' | '+' | '?' | <quantifier> <base> := <char> | <char-class> | '.' | '\' <char> | '(' <regex> ')' | '[' <charset-term> ']' <quantifier> := '{' <number> '}' | '{' <number> ',' <number> '}' | '{' <number> ',}' | '{,' <number> '}' <charset-term> := <charset-factor> { <charset-factor> } <charset-factor> := <char> '-' <char> | <char> | <char-class> <char-class> := '\d' | '\w' | '\s'
I used the grammar from Matt Might's blog post as a starting point, and extended it to include extra syntactic support for character sets, more unary operators, etc.
Parsing yields a tree consisting of 17 possible types of tokens. We
reduce this to a set of 6 core tokens: Character
,
Dot
, Empty
, Union
,
Sequence
, and Star
.
We can express the other 11 "surface" tokens in terms of the core 6 as follows (illustrated informally through example):
a+ |
⟶ | aa* |
||
a? |
⟶ | (|a) |
||
[abc] |
⟶ | a|b|c |
||
[a-e] |
⟶ | a|b|c|d|e |
||
\d |
⟶ | [0-9] |
⟶ | 0|1|2|...|9 |
\w |
⟶ | [A-Za-z0-9_] |
⟶ | A|...|Z|a|...|z|0|...|9|_ |
\s |
⟶ | [ \t\r\n\f] |
⟶ | (space)|\t|\r|\n|\f |
a{3} |
⟶ | aaa |
||
a{2,5} |
⟶ | aa(|a)(|a)(|a) |
||
a{2,} |
⟶ | aaa* |
||
a{,4} |
⟶ | (|a)(|a)(|a)(|a) |
After desugaring, simplifications are applied to the parse tree to
eliminate any unnecessary complexity (i.e. Union(Dot, Dot)
becomes Dot
, Star(Empty)
becomes just
Empty
).
At the base level, we can construct simple NFAs that recognize a single character literal (top), or the empty string (bottom):
The Dot
token is similar to the character literal case,
but I chose to handle this at simulation time.
The three regular operations (Union
, Sequence
,
and Star
) are handled by constructing NFAs for the subcomponents,
and joining/modifying them in particular ways:
Sipser has great diagrams illustrating these, which I recommend checking out.
Composing these construction tools and the base level NFAs, we get an NFA that recognizes the same language as the original regular expression. We can simulate it by maintaining a set of current states, reading characters from the input stream, and updating this set of states to follow transitions.
To see this process in action, consider the example input expression
(a|b)c?
.
(a|b)c?
Sequence( Union( Character('a'), Character('b')), Question( Character('c')))
Sequence( Union( Character('a'), Character('b')), Union( Empty(), Character('c')))