4

Python support for regular expressions

 2 years ago
source link: https://lwn.net/SubscriberLink/885682/48b636e302eea2fa/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!

Regular expressions are a common feature of computer languages, especially higher-level languages like Ruby, Perl, Python, and others, for doing fairly sophisticated text-pattern matching. Some languages, including Perl, incorporate regular expressions into the language itself, while others have classes or libraries that come with the language installation. Python's standard library has the re module, which provides facilities for working with regular expressions; as a recent discussion on the python-ideas mailing shows, though, that module has somewhat fallen by the wayside in recent times.

Timeouts?

J.B. Langston posted to the list on February 14; he had filed a bug about a problem he encountered using re, which was closed with a suggestion that he bring it to the mailing list. Langston had a regular expression (which is often abbreviated as "regex" or "regexp") that seemed to hang his program when applied to a rarely occurring log message; it turned out that there was a flaw in his regular expression, which caused an enormous amount of backtracking that, effectively, never completed. In the bug report, he was asking for a way to specify a timeout:

I will try to rewrite my regex to address this specific issue, but it's hard to anticipate every possible input and craft a bulletproof regex, so something like this kind of thing can be used for a denial of service attack (intentional or not). In this case the regex was used in an automated import process and caused the process to back up for many hours before someone noticed. Maybe a solution could be to add a timeout option to the regex engine so it will give up and throw an exception if the regex executes for longer than the configured timeout.

He elaborated his use case further in his post:

My use case is log parsing and I have a large number of regexes that run over many different log lines. With the volume of regexes I have, it's hard to make sure every regex has no potential problems, especially when the pathological behavior only occurs on certain inputs that may not have been anticipated when developing the regex.

Also because of the volume of data these regexes are parsing, I would never want to allow a regex to run longer than a few milliseconds because if it did, that would kill my log processing throughput. I'd rather that it just raise an exception and move on to the next log entry.

Jonathan Slenders replied that the regex module on the Python Package Index (PyPI) has support for timeouts. regex is meant to be both a drop-in replacement for re (using its "version 0") and to provide support for additional regular-expression features when using "version 1". Those features include nested sets, full Unicode case-folding by default, dropping the global interpreter lock (GIL) during matching for concurrency, and more. The regex home page lists a whole raft of differences when using version 1, including a timeout parameter for the matching functions.

Musings on regular expressions

Tim Peters also mentioned regex, though not because it implements timeouts (which is something he only learned via the thread), but because it is "a terrific module" with many features from newer regular-expression implementations. It is "also harder to provoke into exponential-time bad cases". He discussed some of the tradeoffs that come with regular expressions and recommended the classic book, Mastering Regular Expressions, for those struggling to use them well. He noted that SNOBOL, which is a string-processing language from the 1960s, did not have regular expressions, though there were tasks where it (and its successor of sorts, Icon) were able to do matching in more natural ways:

Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended. Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases. That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.

Those who are not well-versed in regular-expression syntax may find some of that a bit puzzling. While this article cannot be an introduction to regular expressions—there are countless web sites, books, and other resources for that—we will try to give readers a bit of a leg up. In regular-expression syntax, "." represents any single character, adding "*" says to match zero or more occurrences of the previous term, so ".*" matches any string, while ".*X" matches any string up to a literal "X". But, as the following example shows, that may not be exactly what the programmer had in mind:

    >>> import re
    >>> re.search(r'.*X', 'abcXdefXg')
    <re.Match object; span=(0, 8), match='abcXdefX'>

The match value in the object returned from re.search() shows that it matched all the way up to the second occurrence of "X" in the string. That is because the "*" operator is "greedy"—it matches as much as it can. The 20% case in Peters's message uses the non-greedy quantifier "?" to get closer to the result that was asked for:

    >>> re.search(r'.*?X', 'abcXdefXg')
    <re.Match object; span=(0, 4), match='abcX'>

The 5% case, "[^X]*X", uses a negated character set, "[^X]", which means to match anything but "X". So that regular expression can be read as "match any characters that are not 'X', followed by 'X'", but it may well not be the first thing that comes to mind. Steven D'Aprano was surprised that ".*?X" might match beyond the next "X", but Chris Angelico pointed out that non-greedy does not mean it will only match to the next "X":

Nongreedy means it'll prefer the next X, but it has to be open to checking others.

>>> re.search("a.*?X[^X]*?Y", "zzzabbbXcccXdddYzzz")
<re.Match object; span=(3, 16), match='abbbXcccXdddY'>

The X between bbb and ccc won't result in a match, so the .*? has to capture more.

The sub-thread extended a ways, looking into deeper aspects of matching "up to the next 'string'", which shows the complications that can arise—complete with mistaken "solutions". As can be seen, regular expressions are a powerful mechanism, but they are complex, can be used inappropriately, and are prone to bugs of various sorts; they are also difficult to test fully and to debug when problems are encountered. But they have become ubiquitous in today's computing landscape. Peters said:

As to why regexps prevailed, traction! They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-)

Matthew Barnett ("MRAB"), who is the developer of regex, agreed with that assessment:

Regexes were simple to start with, so only a few metacharacters were needed, the remaining characters being treated as literals.

As new features were added, the existing metacharacters were used in new ways that had been illegal until then in order to remain backwards-compatible.

Add to that that there are multiple implementations with differing (and sometimes only slightly differing) features and behaviours.

It's a good example of evolution: often messy, and resulting in clunky designs.

Back to timeouts and regex

Langston "quite enjoyed reading" the thread that resulted from his post, but wanted to see if there was support "for adding a timeout feature to the Python re library". He said that he would be investigating regex but still thought re could benefit from a way to stop runaway regular expressions. Angelico was opposed to the idea, suggesting that some of the other matching techniques explored in the thread should be pursued instead. "It would add overhead to common cases in order to put a shield around pathological ones, and it's difficult to impossible to usefully define the cutoff." As he noted in his first reply, though, Peters thinks that no work is likely to be done on features for re:

Buried in the fun discussion was my guess: no way. Python's re is effectively dead legacy code, with no current "owner". Its commit history shows very little activity for some years already. Most commits are due to generic "code cleanup" crusades that have nothing specific to do with the algorithms. None required non-trivial knowledge of the implementation.

Peters said that the code that makes up the regex module was originally targeted at becoming part of core Python back in 2008 by its original author, Jeffrey C. Jacobs; the work was picked up and carried forward by Barnett, and eventually turned into regex. The request to switch re over to using it was closed in 2021 because of the existence of regex in PyPI; "If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first."

The problem is that regex has "_dozens_ of features that would be valuable to have in the standard library", Peters said, so:

[N]o core dev I know of is going to devote their limited time to reproducing a tiny subset of regex's many improvements in Python's legacy engine. In fact, "install regex!" is such an obvious choice at this point that I wouldn't even give time to just reviewing a patch that added timeouts.

Barnett said that he eventually decided against having the regex code added to the standard library, at least in part because "that would tie fixes and additions to Python's release cycle". Python is known for being "batteries included", "but not nuclear reactors", so having regex in PyPI makes more sense, he said. Peters thought that some features in regex might have been nuclear reactors back in 2008, but are being used more commonly today:

[...] Python's re module is frozen in an ever-receding past. Nobody wants to work on it because, well, "regex already does that! In fact, it's been doing it for 15 years already".

Your module is _too_ successful for Python's good ;-)

Peters also pointed out that trying out regex may be easier than Langston realized. In fact, because of the version 0 compatibility mode, he could perhaps add a simple import to give it a whirl:

    import regex as re
In another message, Peters further described what he meant by that:

What I wrote here is more elaboration on that _trying_ this is easier than they might be thinking: They don't have to, e.g., rewrite their regexps, or invoke different function or method names, or worry that they'll get different results. The packages are highly compatible in syntax and semantics and APIs so long as you stick to the things re does. That in no way suggests they _should_ stick to what re does. It's assuring them that _getting started_ is close to trivial. Their current code should continue to work unchanged, apart from just changing "re" to "regex".

It turns out that Langston's program already uses regex "via some transitive dependency", but he was not impressed with its performance when compared to re. A test data set of 700MB was processed in 77 seconds with re but it took 92 seconds with regex. Peters was "mildly surprised" by that, since: "Most times people report that regex is at least modestly faster than re".

Whither re?

The performance of regex seems like something that might need attention, especially if it is becoming the de facto regular-expression module for Python. Peters's analysis of the status of re is somewhat disheartening; it is apparently permanently stuck in the past. That may be sufficient for many, however, but it somehow seems suboptimal to have two separate pieces of the Python ecosystem that both support the more limited re subset; most enhancements are likely to only go into regex so that re falls further and further behind the state of the art.

Supplanting re with regex in the standard library (using version 0 by default) would seem attractive, though Barnett seems at least somewhat cautious about doing so. A similar thing occurred with the popular Requests HTTP module, which was considered as a possible addition to the standard library at the 2015 Python Language Summit. The conclusion was that it made more sense for Requests to stay out of the standard library because it moves faster than the normal Python release cadence (then 18 months, but now yearly), especially for security updates (which are done more frequently for the language, but not as quickly as Requests can move on its own).

The "batteries included" story for Python has been a major part of its success over the years, but it is starting to fray in various ways. For one thing, the large number of said batteries is straining the maintenance ability of the Python core developers. That has led to discussions, a Python Enhancement Proposal (PEP), and further discussion about removing some parts of the library over the years. Most of the modules in the standard library were added long ago and, once something is added, it is difficult to remove it—even if the reason for its inclusion and the existence of maintainers for it have gone away.

Meanwhile, regular expressions are clearly something that Python programmers use—a lot—so having the best support for them, in one place if possible, seems like the right approach. As Peters noted, they are a "a tool with a hyper-concise notation, where a correct expression is pretty much indistinguishable from line noise, and a typo is rarely detectable as a syntax error". Adding risks of incompatibility (or differing performance) into the mix may not lead to much joy either. It is all a bit of a pickle, but not that pickle, of course.

On the other hand, re, which was originally developed by Fredrik Lundh (who sadly died back in November), does what it needs to do for lots of different use cases. Those who need timeouts, atomic groups, nested character sets, possessive quantifiers, and other advanced features have a place to turn. Barnett seems keen to maintain the compatibility with re, so it may turn out to be a situation, like with Requests, where alternatives to the standard library should be recommended, perhaps even in the Python documentation. There is no clear and obvious "right" solution here it seems.


(Log in to post comments)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK