3

Speeding up text processing in Python (is hard)

 1 year ago
source link: https://pythonspeed.com/articles/faster-text-processing/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Speeding up text processing in Python (is hard)

by Itamar Turner-TrauringLast updated 23 Mar 2023, originally created 23 Mar 2023

If you’re doing text or string manipulation in Python, what do you do if your code is too slow? Assuming your algorithm is reasonably efficient, the next step is to try faster alternatives to Python: a compiled extension.

Unfortunately, this is harder than it seems. Some options don’t offer an easy path to optimizations, others are actually slower. To see this limitation in action, we’ll consider some alternatives:

  • Pure Python, with the default Python interpreter.
  • Cython.
  • mypyc.
  • Rust.
  • Pure Python, with the PyPy interpreter.

We’ll also consider what can be done if these option don’t help.

An example task: name matching

In order to have something to measure and discuss, we will consider a concrete example: matching people’s names.

You are a volunteer with the local volunteer group pushing for the construction of more protected bike lanes; bikes are vastly cheaper and more sustainable than driving, and they’re often just as fast as cars in dense cities. The key bottleneck to adoption: safer infrastructure.

Having gathered signatures for a petition, you want to match them to the names of voters in the city’s voter registration database. This means you need to match names in two formats:

  • The names you have collected for your petition are a single field, for example "Itamar Turner-Trauring", manually typed by a human. Since they are manually typed, spaces and capitalization might be inconsistent.
  • The names in the local voter database are stored in two fields (first and last name), in upper case, for example "ITAMAR" and "TURNER-TRAURING".

Names are complicated

How do you convert a single string containing a full name into two strings containing first and last names?

  1. If someone wrote down only two parts to their name, you split that in two and you’re done.
  2. Three parts is less common, and the middle part has three different ways it might be reflected in the voter database:
    1. Omitted: "John Q. Public" becomes ("JOHN", "PUBLIC").
    2. Part of the first name: "Marie Louise Mignot" becomes ("MARIE LOUISE", "MIGNOT").
    3. Part of the last name: "Shakira Mebarak Ripoll" becomes ("SHAKIRA", "MEBARAK RIPOLL").
  3. Finally, one part and four or more parts are rare enough where you live that you decide to skip them.

To normalize names we come up with the following pure Python implementation:

def single_name_to_first_last_names(
    name: str,
) -> list[tuple[str, str]]:
    parts = name.upper().split()
    if len(parts) == 2:
        return [tuple(parts)]
    elif len(parts) == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

Given a name, it gives us a list of (first, last) pairs we can try to look up in the voter database.

Additional implementations

We can run the above code on the normal Python interpreter (aka CPython) and the sometimes much faster PyPy. But we can also build some alternatives.

Cython

Cython allows mixing Python and C code, but it can just compile most normal Python code. So we can just recompile a tweaked version of the above with Cython:

def single_name_to_first_last_names(
        name: str
) -> list[tuple[str, str]]:
    parts = name.upper().split()
    cdef int length = len(parts)  # C variable!
    if length == 2:
        return [tuple(parts)]
    elif length == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

In practice that change doesn’t seem to make a significant difference to speed, but it does show how you can mix the two languages.

Given the intense use of Python APIs, it’s not clear how one could switch to faster C code in the middle beyond that single usage. We could in theory use C or C++ string processing APIs. But given the safety issues with those languages it’s probably best to avoid that.

mypyc

mypyc is an off-shoot of the mypy Python type checker. It can understand Python type annotations and attempt to use them to optimize Python code; like Cython, it emits a compiled extension. The pure Python code above has a type error when compiled with mypyc, so we use a tweaked version:

def single_name_to_first_last_names(
    name: str,
) -> list[tuple[str, str]]:
    parts = name.upper().split()
    if len(parts) == 2:
        return [(parts[0], parts[1])]  # Make mypyc happy
    elif len(parts) == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

Rust is a fast, memory-safe language. You can build Python extensions using the PyO3 library. Here’s the function implementation, pretty much matching the Python implementation strategy; we do the string processing with Rust’s string APIs.

#[pyfunction]
fn single_name_to_first_last_names(
        py: Python<'_>, name: String
) -> &PyAny {
    let name = name.to_uppercase();
    let parts: Vec<&str> = name.split(' ')
        .filter(|s| s.len() > 0)
        .collect();
    if parts.len() == 2 {
        PyList::new(py, [(parts[0], parts[1])])
    } else if parts.len() == 3 {
        let (a, b, c) = (parts[0], parts[1], parts[2]);
        PyList::new(py, [
            (a, c),
            (a, &[b, c].join(" ")),
            (&[a, b].join(" "), c)
        ])
    } else {
        PyList::empty(py)
    }
}

Performance results, and some observations

Let’s compare the speed of all five options from slowest to fastest:

Version Interpreter Names/second
Rust CPython 3.11 3.6 million
Pure Python CPython 3.11 5.3 million
Cython CPython 3.11 5.3 million
mypyc CPython 3.11 6.6 million
Pure Python PyPy 7.3 6.9 million

Disappointingly, we didn’t get a huge speedup anywhere; the Rust version is actually slower. Why? Let’s discuss each alternative in turn.

Why is Rust slower?

Given Rust is a fast, compiled language, which in some cases can run 20×-100× as fast as Python code, why is the Rust implementation so much slower in this case? (Yes, I am using the release profile.)

I haven’t done any profiling, but here are some theories:

  • There is overhead from converting between Python and Rust objects. CPython’s memory representation of strings is complex, while Rust uses UTF-8: switching between the two takes work.
  • Python’s string processing APIs are actually quite fast and optimized. In some cases they might actually do better than Rust’s. For example, since the string representation CPython uses differs based on the string contents, it’s possible for CPython to use more optimized implementations specifically for ASCII.
  • Given the function is so short, we’re mostly using CPython APIs anyway.

I did write a slightly faster version by omitting the Vec and using an array instead, but it was still only 4 million names/second, still slower than CPython.

Why is Cython no faster, while mypyc is faster?

Most of the code is just calls into CPython APIs. So we’re mostly running the same code as the Python version. mypyc might be able to take better advantage of the type annotations of the built-in methods, something Cython can’t do, but that’s just a theory.

In some earlier iterations of the implementation Cython was the same speed as mypyc.

Why is PyPy faster?

PyPy has a just-in-time compiler so it can notice code that is being called a lot with the same types, and generate a more optimized machine code version on the fly. No compilation needed!

Can we do better?

The best we’ve managed is a 30% speedup: not bad, but also not very impressive. Can we do better? Maybe.

Right now we loop over names in Python, something like:

for name in petition_names:
    variants = single_name_to_first_last_names(name)
    for first, last in variants:
        if in_voter_database(first, last):
            # We found a match!
            # ... do something with it ...

We could instead pass the full list of petition names into the function we are optimizing. If we were to write it in Rust, it might be slower, or no faster—but we could release the GIL for much of the processing, which would allow us to take advantage of multiple cores. So while it wouldn’t save CPU time, it might save us clock time by using multiple cores at once.

Or we could switch away to from Python datastructures to something like a Polars DataFrame, which might give us similar benefits while allowing us to do our task without writing any custom Rust code.

More broadly, this function is just one part of the whole processing algorithm, and our time might be better spent optimizing it in other ways or at a higher level of abstraction.

How should you optimize your string processing code?

So how should you optimize your Python text processing code?

  1. Simply running your program with PyPy may give you a nice boost without any changes at all… assuming your dependencies work on PyPy.
  2. Cython and mypyc are also worth trying; you’re adding a bit of packaging and setup overhead due to the need to compile extensions, but you don’t really need to rewrite your code.
  3. For text, a compiled language like Rust might not help if you’re just converting a small function. Restructuring your code to use bulk operations might change that, or at least enable parallelism in cases where optimization fails.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK