castleman.space

Thomas Castleman's website.

Regular Expressions

2020

In May, I started working on a small regular expressions engine.

This implementation is naive in that it constructs/simulates nondeterministic finite automata (which have equivalent power to regular expressions themselves) in order to perform several textual functions. Most modern engines convert to deterministic finite automata or employ backtracking, neither of which this implementation includes.

My approach was largely inspired by reading Michael Sipser's Introduction to the Theory of Computation, which I highly recommend (especially for its treatment of regular/context-free languages).

The source can be found here.

Supported Syntax

The full syntax supported by this engine is detailed here.

Parsing

Input strings are parsed using recursive descent, so sub-expressions are parsed recursively and are used to construct larger outer expressions.

The following grammar was used as the basis for parsing:

<regex> := <term> '|' <regex>
        |  <term>

<term> := { <factor> }

<factor> := <base> <unary-op>
        | <base>

<unary-op> := '*'
        | '+'
        | '?'
        | <quantifier>

<base> := <char>
    | <char-class>
    | '.'
    | '\' <char>
    | '(' <regex> ')'
    | '[' <charset-term> ']'

<quantifier> := '{' <number> '}'
            | '{' <number> ',' <number> '}'
            | '{' <number> ',}'
            | '{,' <number> '}'

<charset-term> := <charset-factor> { <charset-factor> }

<charset-factor> := <char> '-' <char>
                | <char>
                | <char-class>

<char-class> := '\d'
            | '\w'
            | '\s'

I used the grammar from Matt Might's blog post as a starting point, and extended it to include extra syntactic support for character sets, more unary operators, etc.

Desugaring

Parsing yields a tree consisting of 17 possible types of tokens. We reduce this to a set of 6 core tokens: Character, Dot, Empty, Union, Sequence, and Star.

We can express the other 11 "surface" tokens in terms of the core 6 as follows (illustrated informally through example):

a+ aa*
a? (|a)
[abc] a|b|c
[a-e] a|b|c|d|e
\d [0-9] 0|1|2|...|9
\w [A-Za-z0-9_] A|...|Z|a|...|z|0|...|9|_
\s [ \t\r\n\f] (space)|\t|\r|\n|\f
a{3} aaa
a{2,5} aa(|a)(|a)(|a)
a{2,} aaa*
a{,4} (|a)(|a)(|a)(|a)

After desugaring, simplifications are applied to the parse tree to eliminate any unnecessary complexity (i.e. Union(Dot, Dot) becomes Dot, Star(Empty) becomes just Empty).

NFA Construction/Simulation

At the base level, we can construct simple NFAs that recognize a single character literal (top), or the empty string (bottom):

The Dot token is similar to the character literal case, but I chose to handle this at simulation time.

The three regular operations (Union, Sequence, and Star) are handled by constructing NFAs for the subcomponents, and joining/modifying them in particular ways:

Sipser has great diagrams illustrating these, which I recommend checking out.

Composing these construction tools and the base level NFAs, we get an NFA that recognizes the same language as the original regular expression. We can simulate it by maintaining a set of current states, reading characters from the input stream, and updating this set of states to follow transitions.

Example

To see this process in action, consider the example input expression (a|b)c?.