Anatomy of a regular expression

I've been talking about how regular expressions worried me when it came to the 513 Unicode project, but it occurs to me I haven't explained why.

So the deal is, with this Unicode project BYOND will now be using modified UTF-8 a lot under the hood. (Modified because BYOND's special formatting character will be left alone. It isn't valid in UTF-8 anyway, which is handy.) That means that characters are no longer 1 byte each, but may in fact be as long as 4 bytes.

In the regular expression code, there are multiple places where the code is optimized for searching repeats of single-character sequences. For instance, [A-Za-z] stands for any letter from A-Z, regardless of case. That's actually really easy to handle. But what about [^A-Za-z] which means any character that is not a letter? That match can be 1-4 bytes and when you throw a * or + modifier after that character range to represent a repeat, you can no longer assume that 10 matches in a row is 10 characters. And what if this character range includes anything trans-ASCII like an accented letter or emoji? The . operator in a regular expression means any character (except a line break), and it too can be 1-4 characters wide.

Fortunately there's a bit of a shortcut. The older code can still be used in many places, if the string being searched is only ASCII. Internally, BYOND's string system will now store encoding data to say if the string contains UTF-8 or not. This means I can activate special logic for handling UTF-8 only when it's needed, which for many strings it won't be.

One major cause of trouble though was in character ranges. Not just the [A-Z] variety, but special sequences like \d, \W, etc. These can occur on their own OR inside an existing character range, and the code that was handling them before was somewhat limited. There's a routine in the C library called strspn(), and a similar strcspn(), that takes a whole list of characters as its second argument, and finds out how many matches or non-matches it can make in a search string against that set of characters. The ranges used by regular expressions relied on these functions, but those functions are 1) not suited for UTF-8, requiring special equivalents, and 2) absolutely unsuited for a very very large range of potential matching characters. So I needed to do more special processing.

Internally, a character range was represented by one of two opcodes in the compiled regular expression: one that said we want to match a list of characters, the other that said we want to find a non-match. Now there are six opcodes: two more for cases where the characters include UTF-8, and two for a new compact format when matching against a very large number of characters.

So far my changes are still pending testing, but I have a good feeling about this. I think I made the right call and as a result, this nightmare will soon be behind me. Just wish me luck.

BYOND released this post 7 days early for patrons.   Become a patron
Tier Benefits
Recent Posts