Python 3.11 will boost performance at the cost of a little more memory, speed gains appear to be between 10% and 60%


In an effort to improve the performance of the Python programming language, Microsoft releases Faster CPython. It’s a Microsoft-funded project, members of which include Python inventor Guido van Rossum, Microsoft Senior Software Engineer Eric Snow, and Mark Shannon, who has a contract with Microsoft as the project’s technical lead. Python is widely known for being this slow. While Python will never match the performance of low-level languages ​​like C, Fortran, or even Java, we would like it to be competitive with fast implementations of scripting languages, like V8 for JavaScript, says Mark Shannon.

To be efficient, virtual machines for dynamic languages ​​must specialize the code they execute based on the types and values ​​of the program they are running. This specialization is often associated with Just-in-time (JIT) compilers, but it is also beneficial without generating machine code.

Note that specialization improves performance and adaptation allows the interpreter to change quickly as the pattern of use of a program changes, thus limiting the amount of extra work caused by poor specialization.

A session scheduled as part of the EuroPython event, Europe’s largest Python conference, to be held in Dublin in July will focus on some of the changes that help speed up the process. Shannon will describe the adaptive specialized interpreter of Python 3.11, which is PEP (Python Enhancement Proposal) 659. This is a technique called specialization which, as Shannon explains, is usually done in the context of a compiler, but research shows that specializing in an interpreter can significantly increase performance .

This PEP proposes to use an adaptive interpreter that specializes code dynamically, but over a very small region, and that is able to adapt to bad specialization quickly and inexpensively.

The addition of a specialized and adaptive CPython interpreter will bring significant performance improvements. It is difficult to give meaningful figures, because it depends a lot on benchmarks and on work not yet done. Extensive experiments suggest accelerations of up to 50%. Even if the speed gain were only 25%, it would still be a nice improvement, ”says Shannon.

Specifically, we want to achieve these performance goals with CPython for the benefit of all Python users, including those who cannot use PyPy or other alternative virtual machines, “he adds. When Devclass spoke to Python Leadership Council member and Lead Developer Pablo Galindo of the new Memray memory profiler, described how the Python team is using the work of Microsoft in version 3.11.

One of the things we’re doing is making the interpreter faster, says Pablo Galindo, Python board member and lead developer. But it will also use a little more memory, just a little, because most of these optimizations have some sort of memory cost, since we have to archive things for later use, or because we have an optimized version but a Sometimes someone needs to request a non-optimized version for debugging, so we need to archive both.

Achieving these performance goals is long and will require a great deal of engineering effort, but we can make a significant step towards these goals by speeding up the interpreter. Academic research and practical implementations have shown that a fast interpreter is a key component of a fast virtual machine, Shannon said.

Virtual machine acceleration

Typical optimizations for virtual machines are expensive, so it takes a long “startup” time to make sure the optimization cost is justified. To achieve rapid accelerations without noticeable warm-up times, the VM should assume that specialization is guaranteed even after a few executions of a function. To do this, the interpreter must be able to continuously and economically optimize and de-optimize. Using adaptive and speculative specialization to the granularity of individual virtual machine instructions, the Python team achieved a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.

There are many practical ways to speed up a virtual machine for a dynamic language. However, specialization is the most important, both in itself and as a catalyst for further optimization. So it makes sense to focus our efforts on specialization first, if we want to improve CPython’s performance, “says the Faster CPython project team. Specialization is usually done in the context of a JIT compiler, but research shows that specialization in an interpreter can significantly improve performance and even surpass that of a normal compiler.

Several methods have been proposed in the academic literature, but most attempt to optimize domains larger than a single bytecode. Using domains larger than a single statement requires code to handle deoptimization in the middle of a domain. Specialization at the level of individual bytecodes makes deoptimization trivial, as it cannot happen in the middle of a region.

By speculatively specializing the individual bytecodes, we can achieve significant performance improvements with nothing more than the most local and trivial de-optimizations to implement. The closest approach to this PEP in the literature is “Inline Caching meets Quickening”. This PEP has the benefits of online caching, but adds the ability to de-optimize quickly, making performance more robust in cases where specialization fails or is unstable.

The acceleration due to specialization is difficult to determine, as many specializations depend on other optimizations. Speed ​​gains appear to be between 10% and 60%. Most of the speed gains come directly from specialization. The biggest contributors are the attribute search, the global variable and the acceleration of the callouts.


Adaptive instructions

Each statement that would benefit from the specialization is replaced by an adaptive version during acceleration. For example, the LOAD_ATTR statement is replaced by LOAD_ATTR_ADAPTIVE. Each adaptive education periodically attempts to specialize.


The CPython bytecode contains many instructions that represent high-level operations and would benefit from specialization. Examples include CALL, LOAD_ATTR, LOAD_GLOBAL, and BINARY_ADD.

Introducing a category of specialized instructions for each of these instructions allows for efficient specialization, as each new instruction specializes for a single task. Each family will include an “adaptive” statement, which holds a counter and attempts to specialize when that counter reaches zero. Each category will also include one or more specialized instructions that perform the equivalent of the generic operation much faster, provided their inputs are as expected.

Each specialized instruction maintains a saturation counter which is incremented when the inputs meet expectations. If the entries are not as expected, the counter will be decremented and the generic operation will be performed. If the counter reaches its minimum value, the instruction is deoptimized simply by replacing its opcode with the adaptive version.

Auxiliary data

Most specialized instruction families require more information than an 8-bit operand can hold. For this purpose, a number of 16-bit inputs immediately following the instruction are used to store this data. This is a form of online cache, an “online data cache”. Non-specialized or adaptive instructions will use the first entry in this cache as a counter and just skip the others.


Memory usage

An obvious concern with any system that does some sort of caching is: how much extra memory is it using?

Memory usage comparison with 3.10

CPython 3.10 used 2 bytes per instruction, until the number of executions reached ~ 2000 when allocating another byte per instruction and 32 bytes per instruction with a cache (LOAD_GLOBAL and LOAD_ATTR).

The following table shows the additional bytes per instruction to support opcache 3.10 or the adaptive interpreter proposed on a 64-bit machine.

3.10 cold it is before the code has reached the limit of ~ 2000. 3.10 hot shows cache usage once the threshold is reached. Relative memory usage depends on how much code is active enough to trigger cache creation in 3.10. The breakeven point, where the memory used by 3.10 is the same as 3.11, is ~ 70%. It should also be noted that the actual bytecode is only part of a code object. Code objects also include names, constants, and a lot of debugging information.

In summary, for most applications where many functions are relatively unused, version 3.11 will consume more memory than version 3.10.

Source: Python

And she ?

What is your opinion on the subject?

What do you think of the Faster CPython project?

Version 3.11 will consume more memory than version 3.10, what do you think?

In your opinion, is it interesting to increase performance at the cost of a little more memory?

See also:

Python 3.11 will improve the location of errors in tracebacks and bring new features

Version 3.2 of Django framework is available, with automatic detection of AppConfig, brings new decorators to the admin module

Django 2.0 is available in stable version, what’s new in this version of the web framework written in Python?

JetBrains supports Django: get a 30% discount on the purchase of an individual PyCharm Professional license and all proceeds will be donated to the Django Foundation


Leave a Comment