PEP 703 – Making the Global Interpreter Lock Optional in CPython
source link: https://peps.python.org/pep-0703/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Specification
Build Configuration Changes
The global interpreter lock will remain the default for CPython builds
and python.org downloads. A new build configuration flag,
--without-gil
will be added to the configure script that will
build CPython without the global interpreter lock.
When built with --without-gil
, CPython will define the
Py_NOGIL
macro in Python/patchlevel.h. The ABI tag will include
the letter “n” (for “nogil”).
Overview of CPython Changes
Removing the global interpreter lock requires substantial changes to CPython internals, but relatively few changes to the public Python and C APIs. This section describes the required changes to the CPython implementation followed by the proposed API changes.
The implementation changes can be grouped into the following four categories:
- Reference counting
- Memory management
- Container thread-safety
- Locking and atomic APIs
Reference Counting
Removing the GIL requires changes to CPython’s reference counting implementation to make it thread-safe. Furthermore, it needs to have low execution overhead and allow for efficient scaling with multiple threads. This PEP proposes a combination of three techniques to address these constraints. The first is a switch from plain non-atomic reference counting to biased reference counting, which is a thread-safe reference counting technique with lower execution overhead than plain atomic reference counting. The other two techniques are immortalization and a limited form of deferred reference counting; they address some of the multi-threaded scalability issues with reference counting by avoiding some reference count modifications.
Biased reference counting (BRC) is a technique first described in 2018 by Jiho Choi, Thomas Shull, and Josep Torrellas [3]. It is based on the observation that most objects are only accessed by a single thread, even in multi-threaded programs. Each object is associated with an owning thread (the thread that created it). Reference counting operations from the owning thread use non-atomic instructions to modify a “local” reference count. Other threads use atomic instructions to modify a “shared” reference count. This design avoids many atomic read-modify-write operations that are expensive on contemporary processors.
The implementation of BRC proposed in this PEP largely matches the original description of biased reference counting, but differs in details like the size of reference counting fields and special bits in those fields. BRC requires storing three pieces of information in each object’s header: the “local” reference count, the “shared” reference count, and the identifier of the owning thread. The BRC paper packs these three things into a single 64-bit field. This PEP proposes using three separate pointer-sized fields (i.e., three 64-bit fields on 64-bit platforms) in each object’s header to avoid potential issues due to reference count overflow.
The proposed PyObject
struct (also called struct _object
) is
below:
struct _object {
_PyObject_HEAD_EXTRA
uintptr_t ob_tid;
Py_ssize_t ob_ref_local;
Py_ssize_t ob_ref_shared;
PyTypeObject *ob_type;
};
The details of the new fields are described in the following sections.
Immortalization
Some objects, such as interned strings, small integers, statically
allocated PyTypeObjects, and the True
, False
, and None
objects stay alive for the lifetime of the program. These objects are
marked as immortal by setting the local reference count field
(ob_ref_local
) to -1
and the thread id (ob_tid
) to the
unsigned equivalent(UINTPTR_MAX
). It’s sufficient to check either
of these fields to determine if an object is immortal, which enables
slightly more efficient Py_INCREF
and Py_DECREF
implementations.
The Py_INCREF
and Py_DECREF
macros are no-ops for immortal
objects. This avoids contention on the reference count fields of
these objects when multiple threads access them concurrently.
This proposed immortalization scheme is very similar to PEP 683, but with slightly different bit representation in the reference count fields for immortal objects in order to work with biased reference counting and deferred reference counting.
Biased Reference Counting
Biased reference counting has a fast-path for objects “owned” by the
current thread and a slow-path for other objects. Ownership is
indicated by the ob_tid
field. Determining the thread id
requires platform specific code [5]. Two special values for
ob_tid
are -1
and 0
. A value of -1
indicates the
object is immortal (see Immortalization) and a value of 0
indicates that the object is not owned by any thread.
Threads must give up ownership of an object before that object can be
destroyed. Ownership is most commonly given up when the local
reference count reaches zero, but also can be requested by other
threads. Threads give up ownership by setting ob_tid
to zero, and
adding the local reference count to the shared reference count. If the
combined reference count is zero, the object can be deallocated.
Otherwise, only the shared reference count field is used from that
point onwards.
The ob_ref_local
field stores the local reference count and two
flags. The two most significant bits are used to indicate the object
is immortal or uses deferred reference counting (see Deferred
reference counting).
The ob_ref_shared
field stores the shared reference count and two
flags. The two least significant bits are used to indicate if the
object is “merged” or “queued.” The shared reference count is
therefore shifted left by two. The ob_ref_shared
field uses the
least significant bits because the shared reference count can be
temporarily negative; increfs and decrefs may not be balanced between
threads.
If ob_ref_shared
becomes negative, the current thread requests
that the owning thread merge the two fields. It atomically pushes
the object to the owning thread’s queue of objects to be merged and
sets the “queued” bit on ob_ref_shared
(to prevent duplicate
queueings). The owning thread is notified via the eval_breaker
mechanism. In practice, this operation is rare. Most objects are
only accessed by a single thread and those objects accessed by
multiple threads rarely have negative shared reference counts.
The proposed Py_INCREF
and Py_DECREF
operation should behave
as follows (using C-like pseudo-code):
// low two bits of "ob_ref_shared" are used for flags
#define _Py_SHARED_SHIFT 2
void Py_INCREF(PyObject *op)
{
Py_ssize_t new_local = op->ob_ref_local + 1;
if (new_local == 0)
return; // object is immortal
if (op->ob_tid == _Py_ThreadId())
op->ob_ref_local = new_local;
else
atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
}
void Py_DECREF(PyObject *op)
{
if (op->ob_tid == -1) {
return; // object is immortal
}
if (op->ob_tid == _Py_ThreadId()) {
op->ob_ref_local -= 1;
if (op->ob_ref_local == 0) {
_Py_MergeZeroRefcount(); // merge refcount
}
}
else {
_Py_DecRefShared(); // slow path
}
}
The reference implementation [17] contains implementations of
_Py_MergeZeroRefcount
and _Py_DecRefShared
.
Note that the above is pseudocode: in practice, the implementation
should use “relaxed atomics” to access ob_tid
and
ob_ref_local
to avoid undefined behavior in C and C++.
Deferred Reference Counting
A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently. These objects don’t necessarily live for the lifetime of the program, so immortalization is not a good fit. This PEP proposes a limited form of deferred reference counting to avoid contention on these objects’ reference count fields in multi-threaded programs.
Typically, the interpreter modifies objects’ reference counts as they are pushed to and popped from the interpreter’s stack. The interpreter skips these reference counting operations for objects that use deferred reference counting. Objects that support deferred reference counting are marked by setting the second-most significant bit in the local reference count field to one.
Because some reference counting operations are skipped, the reference count fields no longer reflect the true number of references to these objects. The true reference count is the sum of the reference count fields plus any skipped references from each thread’s interpreter stack. The true reference count can only be safely computed when all threads are paused during cyclic garbage collection. Consequently, objects that use deferred reference counting can only be deallocated during garbage collection cycles.
Note that the objects that use deferred reference counting already naturally form reference cycles in CPython, so they would typically be deallocated by the garbage collector even without deferred reference counting. For example, top-level functions and modules form a reference cycle as do methods and type objects.
Garbage Collector Modifications for Deferred Reference Counting
The tracing garbage collector finds and deallocates unreferenced
objects. Currently, the tracing garbage collector only finds
unreferenced objects that are part of a reference cycle. With
deferred reference counting, the tracing garbage collector will also
find and collect some unreferenced objects that may not be part of
any reference cycle, but whose collection has been delayed due to
deferred reference counting. This requires that all objects that
support deferred reference counting also have a corresponding type
object that supports tracing garbage collection (through the
Py_TPFLAGS_HAVE_GC
flag). Additionally, the garbage collector
will need to traverse each thread’s stack to add references to the GC
reference count at the start of each collection.
Reference Counting Type Objects
Type objects (PyTypeObject
) use a mix of reference counting
techniques. Statically allocated type objects are immortalized because
the objects already live for the lifetime of the program. Heap type
objects use deferred reference counting in combination with per-thread
reference counting. Deferred reference counting is not sufficient to
address the multi-threaded scaling bottlenecks with heap types because
most references to heap types are from object instances, not references
on the interpreter stack.
To address this, heap type reference counts are partially stored in a
distributed manner in per-thread arrays. Every thread stores an
array of local reference counts for each heap type object. Heap type
objects are assigned a unique number that determines its position in
the local reference count arrays. A heap type’s true reference count
is the sum of its entries in the per-thread arrays, plus the reference
count on the PyTypeObject
, plus any deferred references in the
interpreter stack.
Threads may grow their own type reference count arrays as needed when incrementing or decrementing the local reference count of a type object.
Use of the per-thread reference count arrays is limited to a few places:
PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)
: Increments the current thread’s local reference count fortype
, if it is a heap type.subtype_dealloc(PyObject *self)
: Decrements the current thread’s local reference count forself->ob_type
, if the type is a heap type.gcmodule.c
: Adds each thread’s local reference counts to thegc_refs
count for the corresponding heap type object.
Additionally, when a thread terminates, it adds any non-zero local reference counts to each type object’s own reference count field.
Memory Management
CPython currently uses an internal allocator, pymalloc, which is optimized for small object allocation. The pymalloc implementation is not thread-safe without the GIL. This PEP proposes replacing pymalloc with mimalloc, a general-purpose thread-safe allocator with good performance, including for small allocations.
Using mimalloc, with some modifications, also addresses two other issues related to removing the GIL. First, traversing the internal mimalloc structures allows the garbage collector to find all Python objects without maintaining a linked list. This is described in more detail in the garbage collection section. Second, mimalloc heaps and allocations based on size class enable collections like dict to generally avoid acquiring locks during read-only operations. This is described in more detail in the collection thread-safety section.
CPython already requires that objects that support garbage collection
use the GC allocator APIs (typically indirectly by calling
PyType_GenericAlloc
). This PEP would add additional requirements
to the use of the Python allocator APIs. First, Python objects must
be allocated through object allocation APIs, such as
PyType_GenericAlloc
, PyObject_Malloc
, or other Python APIs
that wrap those calls. Python objects should not be allocated through
other APIs, such as raw calls to C’s malloc or the C++ new operator.
Additionally, PyObject_Malloc
should be used only for allocating
Python objects; it should not be used for allocating buffers,
storages, or other data structures that are not PyObjects.
This PEP also imposes restrictions on the pluggable allocator API
(PyMem_SetAllocator
). When compiling without the GIL, allocators
set using this API must eventually delegate the allocation to the
corresponding underlying allocator, such as PyObject_Malloc
, for
Python object allocations. This allows for allocators that “wrap”
underlying allocators, such as Python’s tracemalloc and debug
allocator, but not for wholly replacing the allocator.
CPython Free Lists
CPython makes use of free lists to speed up the allocation of small,
frequently allocated objects like tuples and numbers. These free
lists are not thread-safe and will need to be disabled when building
Python in the --without-gil
mode.
Garbage Collection (Cycle Collection)
The CPython garbage collector requires the following changes to work with this proposal:
- Use of “stop-the-world” to provide thread-safety guarantees that were previously provided by the GIL.
- Elimination of generational garbage collection in favor of non-generational collector.
- Integration with deferred reference counting and biased reference counting.
Stop-the-World
The CPython cycle garbage collector currently relies on the global
interpreter lock to prevent other threads from accessing Python
objects while the collector finds cycles. The GIL is never released
during the cycle-finding routine, so the collector can rely on
stable (i.e., unchanging) reference counts and references for the
duration of that routine. However, following cycle detection, the GIL
may be temporarily released while calling objects’ finalizers and
clear (tp_clear
) functions, allowing other threads to run in an
interleaved fashion.
When running without the GIL, the implementation needs a way to ensure that reference counts remain stable during cycle detection. Threads running Python code must be paused to ensure that references and reference counts remain stable. Once the cycles are identified, other threads are resumed.
The current CPython cyclic garbage collector involves two
cycle-detection passes during each garbage collection cycle.
Consequently, this requires two stop-the-world pauses when running the
garbage collector without the GIL. The first cycle-detection pass
identifies cyclic trash. The second pass runs after finalizers to
identify which objects still remain unreachable. Note that other
threads are resumed before finalizers and tp_clear
functions are
called to avoid introducing potential deadlocks that are not present in
the current CPython behavior.
Thread States
To support pausing threads for garbage collection, the PyThreadState gets a new “status” field. Like the other fields in PyThreadState, the status field is not part of the public CPython API. The status field may be in one of three states:
ATTACHED
DETACHED
GC
The ATTACHED
and DETACHED
states correspond closely to
acquiring and releasing the global interpreter lock. When compiling
without the GIL, functions that previously acquired the GIL instead
transition the thread state to ATTACHED
, and functions that
previously released the GIL transition the thread state
to DETACHED
. Just as threads previously needed to acquire the
GIL before accessing or modifying Python objects, they now must be in
the ATTACHED
state before accessing or modifying Python
objects. Since the same public C-API functions “attach” the thread as
previously acquired the GIL (e.g., PyEval_RestoreThread
), the
requirements for thread initialization in extensions remain the same.
The substantial difference is that multiple threads can be in the
attached state simultaneously, while previously only one thread could
acquire the GIL at a time.
During stop-the-world pauses, the thread performing garbage collection
needs to ensure that no other thread is accessing or modifying Python
objects. All other threads must be in the “GC” state. The garbage
collection thread can transition other threads from the DETACHED
state to the GC state using an atomic compare-and-swap operation on
the status field. Threads in the ATTACHED
state are requested to
pause themselves and set their status to “GC”, using the
existing “eval breaker” mechanism. At the end of the stop-the-world
pause, all threads in the “GC” state are set to DETACHED
and
woken up if they are paused. Threads that were previously attached
(i.e., executing Python bytecode) can re-attach (set their thread
states to ATTACHED
) and resume executing Python code. Threads
that were previously DETACHED
ignore the notification.
Generations
The existing Python garbage collector uses three generations. When compiling without the GIL, the garbage collector will only use a single generation (i.e., it will be non-generational). The primary reason for this change is to reduce the impact of the stop-the-world pauses in multithreaded applications. Frequent stop-the-world pauses for collecting the young generation would have more of an impact on multi-threaded applications than less frequent collections.
Integration With Deferred and Biased Reference Counting
To find unreferenced objects, the cyclic garbage collector computes
the difference between the number of incoming references and the
object’s reference count. This difference is called gc_refs
and
is stored in the _gc_prev
field. If gc_refs
is greater than
zero, then the object is guaranteed to be alive (i.e., not cyclic
trash). If gc_refs
is zero, then the object is only alive if it
is transitively referenced by another live object. When computing
this difference, the collector should traverse each thread’s stack,
and for every deferred reference, increment the gc_refs
for the
referred object. Since generator objects also have stacks with
deferred references, the same procedure is applied to each
generator’s stack.
Python unit tests commonly use gc.collect()
to ensure that any
unreferenced objects are destructed and their finalizers run. Since
biased reference counting can delay the destruction of some objects
that are referenced by multiple threads, it’s convenient to ensure
that those objects are destructed during garbage collection, even
though they may not be part of any reference cycles. While other
threads are paused, the garbage collector thread should merge the
reference counts for any queued objects, but not call any destructors
even if the combined reference count is zero. (Calling destructors
while other threads are paused risks introducing deadlocks.) Once
other threads are resumed, the GC thread should call _Py_Dealloc
on those objects with a zero merged reference count.
Container Thread-Safety
In CPython, the global interpreter lock protects against corruption of
internal interpreter states when multiple threads concurrently access
or modify Python objects. For example, if multiple threads
concurrently modify the same list, the GIL ensures that the length of
the list (ob_size
) accurately matches the number of elements, and
that the reference counts of each element accurately reflect the
number of references to those elements. Without the GIL — and
absent other changes — concurrent modifications would corrupt those
fields and likely lead to program crashes.
The GIL does not necessarily ensure that operations are atomic or
remain correct when multiple operations occur concurrently. For
example, list.extend(iterable)
may not appear atomic if the
iterable has an iterator implemented in Python (or releases the GIL
internally). Similarly, list.remove(x)
can remove the wrong
object if it overlaps with another operation that modifies the list,
depending on the implementation of the equality operator. Still, the
GIL ensures that some operations are effectively atomic. For example,
the constructor list(set)
atomically copies the items of the set
to a new list, and some code relies on that copy being atomic
(i.e., having a snapshot of the items in the set). This PEP preserves
that property.
This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated (lightweight) lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.
Not all Python objects require locks. For example, immutable objects like tuples, strings, and numbers do not require a lock.
Per-object locks with critical sections provide weaker protections than the GIL. Because the GIL doesn’t necessarily ensure that concurrent operations are atomic or correct, the per-object locking scheme also cannot ensure that concurrent operations are atomic or correct. Instead, per-object locking aims for similar protections as the GIL, but with mutual exclusion limited to individual objects.
Most operations on an instance of a container type require locking that object. For example:
list.append
,list.insert
,list.repeat
,PyList_SetItem
dict.__setitem__
,PyDict_SetItem
list.clear
,dict.clear
list.__repr__
,dict.__repr__
, etc.list.extend(iterable)
setiter_iternext
Some operations operate directly on two container objects, with
knowledge about both containers’ internal structure. For example,
there are internal specializations of list.extend(iterable)
for
specific iterable types, like set
. These operations need to lock
both container objects because they access the internals of both
objects simultaneously. Note that the generic implementation of
list.extend
only needs to lock one object (the list) because the
other object is accessed indirectly through the thread-safe iterator
API. Operations that lock two containers are:
list.extend(list)
,list.extend(set)
,list.extend (dictitems)
, and other specializations where the implementation is specialized for argument type.list.concat(list)
list.__eq__(list)
,dict.__eq__(dict)
Some simple operations can be implemented directly with atomic accesses and do not need locks because they only access a single field. These operations include:
len(list)
i.e.,list_length(PyListObject *a)
len(dict)
len(set)
A select few operations optimistically avoid locking to improve performance. These require special implementations and cooperation from the memory allocator:
list[idx]
(list_subscript
)dict[key]
(dict_subscript
)listiter_next
,dictiter_iternextkey/value/item
list.contains
Borrowed References
Per-object locking provides many of the important protections that the GIL provides, but there are a few cases where it’s not sufficient. For example, code that relies on upgrading a borrowed reference to an “owned” reference may be unsafe in certain circumstances:
PyObject *item = PyList_GetItem(list, idx);
Py_INCREF(item);
The GIL ensures that no other thread can modify the list in between
the access and the Py_INCREF
call. Without the GIL – even with
per-object locking – another thread might modify the list leading to
item
being freed between the access and the Py_INCREF
call.
The problematic borrowed reference APIs are supplemented with functions that return “new references” but are otherwise equivalent:
PyList_FetchItem(list, idx)
forPyList_GetItem
PyDict_FetchItem(dict, key)
forPyDict_GetItem
PyWeakref_FetchObject
forPyWeakref_GetObject
Note that some APIs that return borrowed references, such as
PyTuple_GetItem
, are not problematic because tuples are
immutable. Similarly, not all uses of the above APIs are problematic.
For example, PyDict_GetItem
is often used for parsing keyword
argument dictionaries in function calls; those keyword argument
dictionaries are effectively private (not accessible by other
threads).
Python Critical Sections
Straightforward per-object locking could introduce deadlocks that were not present when running with the GIL. Threads may hold locks for multiple objects simultaneously because Python operations can nest. Operations on objects can invoke operations on other objects, acquiring multiple per-object locks. If threads try to acquire the same locks in different orders, they will deadlock.
This PEP proposes a scheme called “Python critical sections” to implicitly release per-object locks to avoid deadlocks. To understand the scheme, we first introduce a general approach to avoid deadlocks, and then propose a refinement of that approach with better performance.
One way to avoid deadlocks is to allow threads to hold only the lock (or locks) for a single operation at a time (typically a single lock, but some operations involve two locks as described above). When a thread begins a nested operation it should suspend the locks for any outer operation: before beginning the nested operation, the locks for the outer operation are released and when the nested operation completes, the locks for the outer operation are reacquired.
Additionally, the locks for any active operation should be suspended around potentially blocking operations, such as I/O (i.e., operations that would have released the GIL). This is because the interaction between locks and blocking operations can lead to deadlocks in the same way as the interaction between multiple locks.
To improve performance, this PEP proposes a variation of the above scheme that still avoids deadlocks. Instead of immediately suspending locks any time a nested operation begins, locks are only suspended if the thread would block (i.e., would have released the GIL). This reduces the number of lock acquisitions and releases for nested operations, while avoiding deadlocks.
The proposed API for Python critical sections are the following four macros. These are intended to be public (usable by C-API extensions), but not parted of the limited API:
Py_BEGIN_CRITICAL_SECTION(PyMutex m);
: Begins a critical section by acquiring the mutexm
. Ifm
is already locked, then locks for any outstanding critical sections are released before this thread waits form
to be unlocked.Py_END_CRITICAL_SECTION(PyMutex m);
: Ends the most recent operation, unlocking the mutexm
. The most recent previous critical section (if any) is resumed if it is currently suspended.Py_BEGIN_CRITICAL_SECTION2(PyMutex m1, PyMutex m2);
: Begins a critical section by acquiring the mutexesm1
andm2
. To ensure consistent lock ordering, the order of acquisition is determined by memory address (i.e., the mutex with lower memory address is acquired first). If either mutex is already locked, then locks for any outstanding critical sections are released before this thread waits form1
orm2
to be unlocked.Py_END_CRITICAL_SECTION2(PyMutex m1, PyMutex m2);
: Behaves the same asPy_END_CRITICAL_SECTION
but unlocks two mutexesm1
andm2
.
Additionally, when a thread transitions from the ATTACHED
state to
the DETACHED
state, it should suspend any active critical
sections. When transitioning from DETACHED
to ATTACHED
, the
most recent suspended critical section, if any, should be resumed.
Optimistically Avoiding Locking
A few operations on dict
and list
optimistically avoid
acquiring the per-object locks. They have a fast path operation that
does not acquire locks, but may fall back to a slower operation that
acquires the dictionary’s or list’s lock when another thread is
concurrently modifying that container.
The operations with an optimistic fast path are:
PyDict_FetchItem/GetItem
anddict.__getitem__
PyList_FetchItem/GetItem
andlist.__getitem__
Additionally, iterators for dict
and list
use the above
functions so they also optimistically avoid locking when returning
the next item.
There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.
The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).
This section describes the challenges with implementing dictionary and list accesses without locking followed by a description of this PEP’s changes to the Python interpreter required to address those challenges.
The main challenge is that retrieving an item from a list or dictionary and incrementing the reference count of that item is not an atomic operation. In between the time the item is retrieved and the reference count is incremented, another thread may modify the list or dictionary, possibly freeing the memory for the previously retrieved item.
A partial attempt at addressing this issue would be to convert the reference count increment to a conditional increment, only incrementing the reference count if it’s not zero. This change is not sufficient because when a Python object’s reference count reaches zero, the object’s destructor is called and the memory storing the object may be re-used for other data structures or returned to the operating system. Instead, this PEP proposes a technique to ensure that the reference count fields remain valid for the duration of the access, so that the conditional reference count increment is safe. This technique requires cooperation from the memory allocator (mimalloc) as well as changes to the list and dictionary objects. The proposed technique is similar to read-copy update (RCU) [6], a synchronization mechanism widely used in the Linux kernel.
The current implementation of list_item
(the C function
implementing list.__getitem__
) is the following:
Py_INCREF(a->ob_item[i]);
return a->ob_item[i];
The proposed implementation uses the conditional increment
(_Py_TRY_INCREF
) and has additional checks:
PyObject **ob_item = atomic_load(&a->ob_item);
PyObject *item = atomic_load(&ob_item[i]);
if (!item || !_Py_TRY_INCREF(item)) goto retry;
if (item != atomic_load(&ob_item[i])) {
Py_DECREF(item);
goto retry;
}
if (ob_item != atomic_load(&a->ob_item)) {
Py_DECREF(item);
goto retry;
}
return item;
The “retry” subroutine implements the locked fallback path when concurrent modifications to the list cause the above fast, non-locking path to fail:
retry:
PyObject *item;
Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
item = a->ob_item[i];
Py_INCREF(item);
Py_END_CRITICAL_SECTION(a->ob_mutex);
return item;
The modifications to the dict
implementation are similar, because
the relevant parts of both list and dictionary retrieval involve
loading an item/value from an array at a known index.
The additional checks following the conditional increment are
necessary because the scheme allows immediate re-use of memory,
including the memory that previously held a PyObject
structure or
list
or dict
array. Without these extra checks, the function
might return a Python object that was never in the list, if the
memory occupied by the Python object previously held a different
PyObject
whose memory previously stored an item in the list.
Mimalloc Changes for Optimistic list
and dict
Access
The implementation requires additional constraints to the memory allocator, including some changes to the mimalloc code. Some background on mimalloc’s implementation is helpful to understand the required changes. Individual allocations from mimalloc are called “blocks.” Mimalloc “pages” contain consecutive blocks that are all the same size. A mimalloc “page” is similar to a “superblock” in other allocators; it is NOT an operating system page. A mimalloc “heap” contains pages of various size classes; each page belongs to a single heap. If none of the blocks of a page are allocated, then mimalloc may re-use the page for a different size class or different heap (i.e., it might reinitialize the page).
The list and dictionary access scheme works by partially restricting re-use of mimalloc pages so that reference count fields remains valid for the duration of the access. The restricted re-use of mimalloc pages is enforced by having separate heaps for Python objects [7]. This ensures that even if an item is freed during access and the memory reused for a new object, the new object’s reference count field is placed at the same location in memory. The reference count field remains valid (or zero) across allocations.
Python objects that support cyclic garbage collection have two extra
fields preceding the PyObject
header, so their reference count
fields are at a different offset from the start of their allocations.
There are therefore two mimalloc heaps for Python objects, one for
objects that support GC and one for objects that do not.
The backing arrays for lists and the PyDictKeysObject
[8] for
dictionaries face hazards similar to those of Python objects. Lists
and dictionaries may be resized concurrently with accesses,
reallocating the backing array or keys object. Thus, there are
two additional mimalloc heaps: one for list arrays and one for
dictionary keys objects. In total, there are five mimalloc heaps:
two for Python objects (GC and non-GC), one for list arrays, one for
dictionary keys, and the default mimalloc heap used for other
allocations.
Mimalloc Page Reuse
It is beneficial to keep the restrictions on mimalloc page reuse to a short period of time to avoid increasing overall memory usage. Precisely limiting the restrictions to list and dictionary accesses would minimize memory usage, but would require expensive synchronizations. At the other extreme, keeping the restrictions until the next GC cycle would avoid introducing any extra synchronizations, but would potentially increase memory usage.
This PEP proposes a system that lies between those two extremes based on FreeBSD’s “GUS” [9]. It uses a combination of global and per-thread counters (or “sequence numbers”) to coordinate the determination of when it is safe to reuse an empty mimalloc page for a different heap or for a different size class, or to return it to the operating system:
- There is a global write sequence number that monotonically increases.
- When a mimalloc page is empty, it’s tagged with the current write sequence number. The thread may also atomically increment the global write sequence number.
- Each thread has a local read sequence number that records the most recent write sequence number it has observed.
- Threads may observe the write sequence number whenever they are not in a list or dictionary access. The reference implementation does this in mimalloc’s slow-path allocation function. This is called regularly enough to be useful, but not so frequently as to introduce significant overhead.
- There is a global read sequence number that stores the minimum of all active threads’ read sequence numbers. A thread may update the global read sequence number by scanning each threads’ local read sequence number. The reference implementation does this before allocating a fresh mimalloc page if there are restricted pages that could possibly be reused.
- An empty mimalloc page may be reused for a different heap or size class when the global read sequence number is larger than the page’s tag number.
The condition that the global read sequence number is larger than the page’s tag is sufficient because it ensures that any thread that had a concurrent optimistic list or dictionary access is finished with that access. In other words, there are no threads accessing the empty blocks in the freed page, so the page can be used for any other purpose or even returned to the operating system.
Optimistic dict
and list
Access Summary
This PEP proposes a technique for thread-safe list and dictionary accesses that typically avoids acquiring locks. This reduces execution overhead and avoids some multi-threaded scaling bottlenecks in common operations, like calling functions and methods. The scheme works by placing temporary restrictions on mimalloc page reuse to ensure that objects’ reference count fields remain valid after objects are freed so that conditional reference count increment operations are safe. The restrictions are placed on mimalloc pages instead of on individual objects to improve opportunities for memory reuse. The restrictions are lifted as soon as the system can determine that there are no outstanding accesses involving the empty mimalloc page. To determine this, the system uses a combination of lightweight per-thread sequence counters and also tags pages when they are empty. Once each thread’s local counter is larger than the page’s tag, it can be reused for any purpose or returned to the operating system. The restrictions are also lifted whenever the cyclic garbage collector runs because the stop-the-world pause ensures that threads do not have any outstanding references to empty mimalloc pages.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK