Python performance: it’s not just the interpreter

I have a particular view of Python performance from my experience on the Pyston project, and since this view is somewhat nonstandard I wanted to take some time to explain it and give a motivating example.

A common explanation for Python's low performance is that it is an interpreted language. In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark. Instead we'll see that dynamic features -- particularly, features inside the runtime -- are to blame.

tldr

In this particular microbenchmark, the time goes to:

31% argument passing
31% "real work"
13% int boxing
11% interpreter overhead
8% recursion-limit checking
6% dynamic type checking

[I've excluded some "easily avoidable" overheads from this list.]

I picked a particular microbenchmark that I knew had inefficient argument passing, but it's still interesting to see that that cost is almost 3x the cost of using an interpreter. I was also surprised that recursion-limit checking played such a role since it is not often mentioned.

I'm calling argument passing a "dynamic feature" because in this case it is completely avoidable. Python argument passing can be expensive because it requires packing up arguments into a tuple allocated for this purpose, as well as interpretation of a format string to determine how to parse the arguments on the callee side.

The microbenchmark

I've collected these numbers from a simple microbenchmark which I pulled from a real-world use case (django templating); I'm not claiming that it's particularly representative, just that it's illustrative and an interesting data point.

Typically a program's performance is analyzed by running a profiler to see what parts of the program take the most time. I'm going to take a bit of a different approach: I'm going to iteratively optimize the benchmark and see how much each optimization helps. I believe this gives a much clearer view of the cost of each feature that we optimize away, as well as providing hints as to what it would take to do these optimizations automatically.

The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we'll get into. The motivation for this is to imagine that you have a template with many variables in it, all of which need to be converted to strings for rendering.

Here's the code :

def main():
 for j in range(20):
   for i in range(1000000):
     str(i)
main()

(You can find all the versions on my github )

The doubly-nested loops will become handy later.

In terms of methodology, I'm simply running all of these benchmarks on my development machine, doing three runs of each benchmark and reporting the median. The performance differences are pretty stark so we don't need extremely precise or accurate benchmarking, so I kept it simple. Unless mentioned otherwise, I'm running all benchmarks via my Ubuntu-provided python3 binary, which is Python 3.7.3.

Results

On my machine, the benchmark takes 2.33s

We can do a simple optimization : referencing str every iteration of the loop forces the interpreter to do a somewhat-expensive global variable lookup, which is slower than a local variable lookup. We can cache the str object into a local variable, and this brings the benchmark down to 2.07s .

The next goal is to move the for loop out of Python and into C, and in this case we can do that with the map() function. The benchmark is now

for j in range(20):
    list(map(str, range(1000000)))

This version does more work since it is creating a list of all the strings. Perhaps surprisingly, the wins from removing the interpreter almost outweigh this extra work, and this new version comes in at 2.11s

As a point of reference, if we create the same list via a list comprehension , the benchmark takes 2.34s . The optimization from 2.34s to 2.11s by using map lets us calculate the "interpreter overhead" as 10% of the program's execution. 10% is large in many situations, but is not nearly enough to explain Python's reputation for being slow.

To proceed further, we're going to need to move into C extension territory. I ran Cython (a Python->C converter) on the previous benchmark, and it runs in exactly the same amount of time: 2.11s . I wrote a simplified C extension in 36 lines compared to Cython's 3600, and it too runs in 2.11s . [The variance in the benchmark runs is about 0.03s, so getting the same exact median time seems lucky.]

For reference here's the C code .

for (int i = 0; i < 20; i++) {
    PyObject* a = PyTuple_Pack(1, PyLong_FromLong(1000000));
    PyObject* r = PyObject_Call((PyObject*)&PyRange_Type, a, NULL);
    Py_DECREF(a);

    a = PyTuple_Pack(2, (PyObject*)&PyUnicode_Type, r);
    Py_DECREF(r);
    PyObject* m = PyObject_Call((PyObject*)&PyMap_Type, a, NULL);
    Py_DECREF(a);

    a = PyTuple_Pack(1, m);
    Py_DECREF(m);
    PyObject* l = PyObject_Call((PyObject*)&PyList_Type, a, NULL);
    Py_DECREF(a);

    Py_DECREF(l);
}

The next thing I did was to eliminate the list again, and simply iterate through the map object. This brings the time down to 1.86s

Then I included and inlined the map iteration function . This didn't affect the runtime, which is now 1.87s .

The next thing to optimize is the feature that map() can take a variable number of arguments, which slows down its processing slightly. Hard-coding that this map has exactly one argument reduces the time slightly to 1.82s .

The next thing I did is to hoist some of the map iteration code out of the loop: map_next() typically has to reread the relevant variables from memory every iteration, so I thought doing that once outside the loop would help. Surprisingly the runtime is the same: 1.82s .

Next I copied and inlined _PyObject_FastCallDict and rangeiter_next . Surprisingly, switching to the copied version of _PyObject_FastCallDict slowed the program down noticeably, to 1.88s . My hypothesis is that this is because my extension module doesn't have PGO enabled, whereas I believe the Ubuntu Python build does have PGO enabled, so the copied code now has fewer optimizations.

Next I optimized _PyObject_FastCallDict for this particular case, removing some checks that are constant once we know the function we are calling. This simulates static typing, or speculative type inference in the case of a JIT. This didn't make much of a difference: 1.87s .

Now we're getting to the heart of the problem. str() is slow because it's not a function: it's a type. And types in Python have complex calling behavior since they execute Python's two-phase construction, allowing overrides at each step. This is rather unfortunate for common type conversions, since they do not do much work but invoke fairly expensive constructor behavior. I inlined type_call as well as unicode_new , and the performance is roughly the same: 1.86s .

Now that we have all the code in one place, we can see both the argument packing and parsing+unpacking. Python has a traditional calling convention, and a faster newer calling convention. Unfortunately calling a type falls back to the slower calling convention, requiring the allocation of an args tuple, and parsing it on the receiving side. I optimized the current benchmark by hard-coding the argument parsing. This had a large impact: the runtime is now 1.46s .

I inlined PyObject_Str , and this again slowed performance down: 1.52s . Again, I think this is due to the lack of PGO.

Next I optimized based on the assumption that str(int) returns a string object. This reduced the runtime to 1.40s

Now that we've removed enough code we can finally make another big optimization: no longer allocating the args tuple This cuts runtime down to 1.15s .

Next I hoisted the recursion-checking code out of the loop: 0.98s

Including and inlining long_to_decimal_string : 0.94s

Converting the range iteration to a C for loop : 0.91s

Removing the unicode conversion (the real work) : 0.27s

Not allocating a Python int for each iteration of the loop : 0.0s

So that's everything

For reference: NodeJS takes 1.38s to run an equivalent benchmark . And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.

To be continued...

I believe that these findings point towards a certain direction for Python optimization. I have some ideas in this space and hopefully you'll hear more about them on this blog!

tldr

The microbenchmark

Results

To be continued...

Recommend

Microsoft Project Reunion

Preview minimal CSS frameworks with this drop-in switcher

从 Macintosh 到 Mac OS：Mac 操作系统视觉史

人大代表杨元庆提六项建议，发展智慧经济推动产业跃迁

触控板玩的溜真能比鼠标效率高？

国际标准化组织发展史 | 戈吉

Traefik 2.0 两层Traffic应用实践 - konakona

C语言之指针 - TechZone

GitHub - axipo/mini-status: 小程序miniStatus的后端项目，欢迎star

美国不太愿意与穷国分享新冠疫苗知识产权

About Joyk