Regular Expressions

2020

In May, I started working on a small regular expressions engine.

This implementation is naive in that it constructs/simulates nondeterministic finite automata (which have equivalent power to regular expressions themselves) in order to perform several textual functions. Most modern engines convert to deterministic finite automata or employ backtracking, neither of which this implementation includes.

My approach was largely inspired by reading Michael Sipser's Introduction to the Theory of Computation, which I highly recommend (especially for its treatment of regular/context-free languages).

The source can be found here.

Supported Syntax

The full syntax supported by this engine is detailed here.

Parsing

Input strings are parsed using recursive descent, so sub-expressions are parsed recursively and are used to construct larger outer expressions.

The following grammar was used as the basis for parsing:

<regex> := <term> '|' <regex>
        |  <term>

<term> := { <factor> }

<factor> := <base> <unary-op>
        | <base>

<unary-op> := '*'
        | '+'
        | '?'
        | <quantifier>

<base> := <char>
    | <char-class>
    | '.'
    | '\' <char>
    | '(' <regex> ')'
    | '[' <charset-term> ']'

<quantifier> := '{' <number> '}'
            | '{' <number> ',' <number> '}'
            | '{' <number> ',}'
            | '{,' <number> '}'

<charset-term> := <charset-factor> { <charset-factor> }

<charset-factor> := <char> '-' <char>
                | <char>
                | <char-class>

<char-class> := '\d'
            | '\w'
            | '\s'

I used the grammar from Matt Might's blog post as a starting point, and extended it to include extra syntactic support for character sets, more unary operators, etc.

Desugaring

Parsing yields a tree consisting of 17 possible types of tokens. We reduce this to a set of 6 core tokens: Character, Dot, Empty, Union, Sequence, and Star.

We can express the other 11 "surface" tokens in terms of the core 6 as follows (illustrated informally through example):

`a+`	⟶	`aa*`
`a?`	⟶	`(\|a)`
`[abc]`	⟶	`a\|b\|c`
`[a-e]`	⟶	`a\|b\|c\|d\|e`
`\d`	⟶	`[0-9]`	⟶	`0\|1\|2\|...\|9`
`\w`	⟶	`[A-Za-z0-9_]`	⟶	`A\|...\|Z\|a\|...\|z\|0\|...\|9\|_`
`\s`	⟶	`[ \t\r\n\f]`	⟶	`(space)\|\t\|\r\|\n\|\f`
`a{3}`	⟶	`aaa`
`a{2,5}`	⟶	`aa(\|a)(\|a)(\|a)`
`a{2,}`	⟶	`aaa*`
`a{,4}`	⟶	`(\|a)(\|a)(\|a)(\|a)`

After desugaring, simplifications are applied to the parse tree to eliminate any unnecessary complexity (i.e. Union(Dot, Dot) becomes Dot, Star(Empty) becomes just Empty).

NFA Construction/Simulation

At the base level, we can construct simple NFAs that recognize a single character literal (top), or the empty string (bottom):

The Dot token is similar to the character literal case, but I chose to handle this at simulation time.

The three regular operations (Union, Sequence, and Star) are handled by constructing NFAs for the subcomponents, and joining/modifying them in particular ways:

Union: a new start state is added, with epsilon arrows leading to the start states of both the left and right sub-NFAs.
Sequence: the left sub-NFA's accept states are all given epsilon arrows connecting to the right NFA's start state.
Star: a new (accepting) start state is added with an epsilon arrow to the start of the base NFA, and epsilon arrows are added from all the base's accept states back to its original start.

Sipser has great diagrams illustrating these, which I recommend checking out.

Composing these construction tools and the base level NFAs, we get an NFA that recognizes the same language as the original regular expression. We can simulate it by maintaining a set of current states, reading characters from the input stream, and updating this set of states to follow transitions.

Example

To see this process in action, consider the example input expression (a|b)c?.

Input: (a|b)c?

Parsed representation:

Sequence(
  Union(
    Character('a'),
    Character('b')),
  Question(
    Character('c')))

Desugared representation:

Sequence(
  Union(
    Character('a'),
    Character('b')),
  Union(
    Empty(),
    Character('c')))

NFA representation: