4

PEP 703 – Making the Global Interpreter Lock Optional in CPython

 1 year ago
source link: https://peps.python.org/pep-0703/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Specification

Build Configuration Changes

The global interpreter lock will remain the default for CPython builds and python.org downloads. A new build configuration flag, --without-gil will be added to the configure script that will build CPython without the global interpreter lock.

When built with --without-gil, CPython will define the Py_NOGIL macro in Python/patchlevel.h. The ABI tag will include the letter “n” (for “nogil”).

Overview of CPython Changes

Removing the global interpreter lock requires substantial changes to CPython internals, but relatively few changes to the public Python and C APIs. This section describes the required changes to the CPython implementation followed by the proposed API changes.

The implementation changes can be grouped into the following four categories:

  • Reference counting
  • Memory management
  • Container thread-safety
  • Locking and atomic APIs

Reference Counting

Removing the GIL requires changes to CPython’s reference counting implementation to make it thread-safe. Furthermore, it needs to have low execution overhead and allow for efficient scaling with multiple threads. This PEP proposes a combination of three techniques to address these constraints. The first is a switch from plain non-atomic reference counting to biased reference counting, which is a thread-safe reference counting technique with lower execution overhead than plain atomic reference counting. The other two techniques are immortalization and a limited form of deferred reference counting; they address some of the multi-threaded scalability issues with reference counting by avoiding some reference count modifications.

Biased reference counting (BRC) is a technique first described in 2018 by Jiho Choi, Thomas Shull, and Josep Torrellas [3]. It is based on the observation that most objects are only accessed by a single thread, even in multi-threaded programs. Each object is associated with an owning thread (the thread that created it). Reference counting operations from the owning thread use non-atomic instructions to modify a “local” reference count. Other threads use atomic instructions to modify a “shared” reference count. This design avoids many atomic read-modify-write operations that are expensive on contemporary processors.

The implementation of BRC proposed in this PEP largely matches the original description of biased reference counting, but differs in details like the size of reference counting fields and special bits in those fields. BRC requires storing three pieces of information in each object’s header: the “local” reference count, the “shared” reference count, and the identifier of the owning thread. The BRC paper packs these three things into a single 64-bit field. This PEP proposes using three separate pointer-sized fields (i.e., three 64-bit fields on 64-bit platforms) in each object’s header to avoid potential issues due to reference count overflow.

The proposed PyObject struct (also called struct _object) is below:

struct _object {
  _PyObject_HEAD_EXTRA
  uintptr_t ob_tid;
  Py_ssize_t ob_ref_local;
  Py_ssize_t ob_ref_shared;
  PyTypeObject *ob_type;
};

The details of the new fields are described in the following sections.

Immortalization

Some objects, such as interned strings, small integers, statically allocated PyTypeObjects, and the True, False, and None objects stay alive for the lifetime of the program. These objects are marked as immortal by setting the local reference count field (ob_ref_local) to -1 and the thread id (ob_tid) to the unsigned equivalent(UINTPTR_MAX). It’s sufficient to check either of these fields to determine if an object is immortal, which enables slightly more efficient Py_INCREF and Py_DECREF implementations.

The Py_INCREF and Py_DECREF macros are no-ops for immortal objects. This avoids contention on the reference count fields of these objects when multiple threads access them concurrently.

This proposed immortalization scheme is very similar to PEP 683, but with slightly different bit representation in the reference count fields for immortal objects in order to work with biased reference counting and deferred reference counting.

Biased Reference Counting

Biased reference counting has a fast-path for objects “owned” by the current thread and a slow-path for other objects. Ownership is indicated by the ob_tid field. Determining the thread id requires platform specific code [5]. Two special values for ob_tid are -1 and 0. A value of -1 indicates the object is immortal (see Immortalization) and a value of 0 indicates that the object is not owned by any thread.

Threads must give up ownership of an object before that object can be destroyed. Ownership is most commonly given up when the local reference count reaches zero, but also can be requested by other threads. Threads give up ownership by setting ob_tid to zero, and adding the local reference count to the shared reference count. If the combined reference count is zero, the object can be deallocated. Otherwise, only the shared reference count field is used from that point onwards.

The ob_ref_local field stores the local reference count and two flags. The two most significant bits are used to indicate the object is immortal or uses deferred reference counting (see Deferred reference counting).

The ob_ref_shared field stores the shared reference count and two flags. The two least significant bits are used to indicate if the object is “merged” or “queued.” The shared reference count is therefore shifted left by two. The ob_ref_shared field uses the least significant bits because the shared reference count can be temporarily negative; increfs and decrefs may not be balanced between threads.

If ob_ref_shared becomes negative, the current thread requests that the owning thread merge the two fields. It atomically pushes the object to the owning thread’s queue of objects to be merged and sets the “queued” bit on ob_ref_shared (to prevent duplicate queueings). The owning thread is notified via the eval_breaker mechanism. In practice, this operation is rare. Most objects are only accessed by a single thread and those objects accessed by multiple threads rarely have negative shared reference counts.

The proposed Py_INCREF and Py_DECREF operation should behave as follows (using C-like pseudo-code):

// low two bits of "ob_ref_shared" are used for flags
#define _Py_SHARED_SHIFT 2

void Py_INCREF(PyObject *op)
{
  Py_ssize_t new_local = op->ob_ref_local + 1;
  if (new_local == 0)
    return;  // object is immortal
  if (op->ob_tid == _Py_ThreadId())
    op->ob_ref_local = new_local;
  else
    atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
}

void Py_DECREF(PyObject *op)
{
  if (op->ob_tid == -1) {
    return;  // object is immortal
  }
  if (op->ob_tid == _Py_ThreadId()) {
    op->ob_ref_local -= 1;
    if (op->ob_ref_local == 0) {
      _Py_MergeZeroRefcount(); // merge refcount
    }
  }
  else {
    _Py_DecRefShared(); // slow path
  }
}

The reference implementation [17] contains implementations of _Py_MergeZeroRefcount and _Py_DecRefShared.

Note that the above is pseudocode: in practice, the implementation should use “relaxed atomics” to access ob_tid and ob_ref_local to avoid undefined behavior in C and C++.

Deferred Reference Counting

A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently. These objects don’t necessarily live for the lifetime of the program, so immortalization is not a good fit. This PEP proposes a limited form of deferred reference counting to avoid contention on these objects’ reference count fields in multi-threaded programs.

Typically, the interpreter modifies objects’ reference counts as they are pushed to and popped from the interpreter’s stack. The interpreter skips these reference counting operations for objects that use deferred reference counting. Objects that support deferred reference counting are marked by setting the second-most significant bit in the local reference count field to one.

Because some reference counting operations are skipped, the reference count fields no longer reflect the true number of references to these objects. The true reference count is the sum of the reference count fields plus any skipped references from each thread’s interpreter stack. The true reference count can only be safely computed when all threads are paused during cyclic garbage collection. Consequently, objects that use deferred reference counting can only be deallocated during garbage collection cycles.

Note that the objects that use deferred reference counting already naturally form reference cycles in CPython, so they would typically be deallocated by the garbage collector even without deferred reference counting. For example, top-level functions and modules form a reference cycle as do methods and type objects.

Garbage Collector Modifications for Deferred Reference Counting

The tracing garbage collector finds and deallocates unreferenced objects. Currently, the tracing garbage collector only finds unreferenced objects that are part of a reference cycle. With deferred reference counting, the tracing garbage collector will also find and collect some unreferenced objects that may not be part of any reference cycle, but whose collection has been delayed due to deferred reference counting. This requires that all objects that support deferred reference counting also have a corresponding type object that supports tracing garbage collection (through the Py_TPFLAGS_HAVE_GC flag). Additionally, the garbage collector will need to traverse each thread’s stack to add references to the GC reference count at the start of each collection.

Reference Counting Type Objects

Type objects (PyTypeObject) use a mix of reference counting techniques. Statically allocated type objects are immortalized because the objects already live for the lifetime of the program. Heap type objects use deferred reference counting in combination with per-thread reference counting. Deferred reference counting is not sufficient to address the multi-threaded scaling bottlenecks with heap types because most references to heap types are from object instances, not references on the interpreter stack.

To address this, heap type reference counts are partially stored in a distributed manner in per-thread arrays. Every thread stores an array of local reference counts for each heap type object. Heap type objects are assigned a unique number that determines its position in the local reference count arrays. A heap type’s true reference count is the sum of its entries in the per-thread arrays, plus the reference count on the PyTypeObject, plus any deferred references in the interpreter stack.

Threads may grow their own type reference count arrays as needed when incrementing or decrementing the local reference count of a type object.

Use of the per-thread reference count arrays is limited to a few places:

  • PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems): Increments the current thread’s local reference count for type, if it is a heap type.
  • subtype_dealloc(PyObject *self): Decrements the current thread’s local reference count for self->ob_type, if the type is a heap type.
  • gcmodule.c: Adds each thread’s local reference counts to the gc_refs count for the corresponding heap type object.

Additionally, when a thread terminates, it adds any non-zero local reference counts to each type object’s own reference count field.

Memory Management

CPython currently uses an internal allocator, pymalloc, which is optimized for small object allocation. The pymalloc implementation is not thread-safe without the GIL. This PEP proposes replacing pymalloc with mimalloc, a general-purpose thread-safe allocator with good performance, including for small allocations.

Using mimalloc, with some modifications, also addresses two other issues related to removing the GIL. First, traversing the internal mimalloc structures allows the garbage collector to find all Python objects without maintaining a linked list. This is described in more detail in the garbage collection section. Second, mimalloc heaps and allocations based on size class enable collections like dict to generally avoid acquiring locks during read-only operations. This is described in more detail in the collection thread-safety section.

CPython already requires that objects that support garbage collection use the GC allocator APIs (typically indirectly by calling PyType_GenericAlloc). This PEP would add additional requirements to the use of the Python allocator APIs. First, Python objects must be allocated through object allocation APIs, such as PyType_GenericAlloc, PyObject_Malloc, or other Python APIs that wrap those calls. Python objects should not be allocated through other APIs, such as raw calls to C’s malloc or the C++ new operator. Additionally, PyObject_Malloc should be used only for allocating Python objects; it should not be used for allocating buffers, storages, or other data structures that are not PyObjects.

This PEP also imposes restrictions on the pluggable allocator API (PyMem_SetAllocator). When compiling without the GIL, allocators set using this API must eventually delegate the allocation to the corresponding underlying allocator, such as PyObject_Malloc, for Python object allocations. This allows for allocators that “wrap” underlying allocators, such as Python’s tracemalloc and debug allocator, but not for wholly replacing the allocator.

CPython Free Lists

CPython makes use of free lists to speed up the allocation of small, frequently allocated objects like tuples and numbers. These free lists are not thread-safe and will need to be disabled when building Python in the --without-gil mode.

Garbage Collection (Cycle Collection)

The CPython garbage collector requires the following changes to work with this proposal:

  • Use of “stop-the-world” to provide thread-safety guarantees that were previously provided by the GIL.
  • Elimination of generational garbage collection in favor of non-generational collector.
  • Integration with deferred reference counting and biased reference counting.

Stop-the-World

The CPython cycle garbage collector currently relies on the global interpreter lock to prevent other threads from accessing Python objects while the collector finds cycles. The GIL is never released during the cycle-finding routine, so the collector can rely on stable (i.e., unchanging) reference counts and references for the duration of that routine. However, following cycle detection, the GIL may be temporarily released while calling objects’ finalizers and clear (tp_clear) functions, allowing other threads to run in an interleaved fashion.

When running without the GIL, the implementation needs a way to ensure that reference counts remain stable during cycle detection. Threads running Python code must be paused to ensure that references and reference counts remain stable. Once the cycles are identified, other threads are resumed.

The current CPython cyclic garbage collector involves two cycle-detection passes during each garbage collection cycle. Consequently, this requires two stop-the-world pauses when running the garbage collector without the GIL. The first cycle-detection pass identifies cyclic trash. The second pass runs after finalizers to identify which objects still remain unreachable. Note that other threads are resumed before finalizers and tp_clear functions are called to avoid introducing potential deadlocks that are not present in the current CPython behavior.

Thread States

To support pausing threads for garbage collection, the PyThreadState gets a new “status” field. Like the other fields in PyThreadState, the status field is not part of the public CPython API. The status field may be in one of three states:

  • ATTACHED
  • DETACHED
  • GC

The ATTACHED and DETACHED states correspond closely to acquiring and releasing the global interpreter lock. When compiling without the GIL, functions that previously acquired the GIL instead transition the thread state to ATTACHED, and functions that previously released the GIL transition the thread state to DETACHED. Just as threads previously needed to acquire the GIL before accessing or modifying Python objects, they now must be in the ATTACHED state before accessing or modifying Python objects. Since the same public C-API functions “attach” the thread as previously acquired the GIL (e.g., PyEval_RestoreThread), the requirements for thread initialization in extensions remain the same. The substantial difference is that multiple threads can be in the attached state simultaneously, while previously only one thread could acquire the GIL at a time.

During stop-the-world pauses, the thread performing garbage collection needs to ensure that no other thread is accessing or modifying Python objects. All other threads must be in the “GC” state. The garbage collection thread can transition other threads from the DETACHED state to the GC state using an atomic compare-and-swap operation on the status field. Threads in the ATTACHED state are requested to pause themselves and set their status to “GC”, using the existing “eval breaker” mechanism. At the end of the stop-the-world pause, all threads in the “GC” state are set to DETACHED and woken up if they are paused. Threads that were previously attached (i.e., executing Python bytecode) can re-attach (set their thread states to ATTACHED) and resume executing Python code. Threads that were previously DETACHED ignore the notification.

Generations

The existing Python garbage collector uses three generations. When compiling without the GIL, the garbage collector will only use a single generation (i.e., it will be non-generational). The primary reason for this change is to reduce the impact of the stop-the-world pauses in multithreaded applications. Frequent stop-the-world pauses for collecting the young generation would have more of an impact on multi-threaded applications than less frequent collections.

Integration With Deferred and Biased Reference Counting

To find unreferenced objects, the cyclic garbage collector computes the difference between the number of incoming references and the object’s reference count. This difference is called gc_refs and is stored in the _gc_prev field. If gc_refs is greater than zero, then the object is guaranteed to be alive (i.e., not cyclic trash). If gc_refs is zero, then the object is only alive if it is transitively referenced by another live object. When computing this difference, the collector should traverse each thread’s stack, and for every deferred reference, increment the gc_refs for the referred object. Since generator objects also have stacks with deferred references, the same procedure is applied to each generator’s stack.

Python unit tests commonly use gc.collect() to ensure that any unreferenced objects are destructed and their finalizers run. Since biased reference counting can delay the destruction of some objects that are referenced by multiple threads, it’s convenient to ensure that those objects are destructed during garbage collection, even though they may not be part of any reference cycles. While other threads are paused, the garbage collector thread should merge the reference counts for any queued objects, but not call any destructors even if the combined reference count is zero. (Calling destructors while other threads are paused risks introducing deadlocks.) Once other threads are resumed, the GC thread should call _Py_Dealloc on those objects with a zero merged reference count.

Container Thread-Safety

In CPython, the global interpreter lock protects against corruption of internal interpreter states when multiple threads concurrently access or modify Python objects. For example, if multiple threads concurrently modify the same list, the GIL ensures that the length of the list (ob_size) accurately matches the number of elements, and that the reference counts of each element accurately reflect the number of references to those elements. Without the GIL — and absent other changes — concurrent modifications would corrupt those fields and likely lead to program crashes.

The GIL does not necessarily ensure that operations are atomic or remain correct when multiple operations occur concurrently. For example, list.extend(iterable) may not appear atomic if the iterable has an iterator implemented in Python (or releases the GIL internally). Similarly, list.remove(x) can remove the wrong object if it overlaps with another operation that modifies the list, depending on the implementation of the equality operator. Still, the GIL ensures that some operations are effectively atomic. For example, the constructor list(set) atomically copies the items of the set to a new list, and some code relies on that copy being atomic (i.e., having a snapshot of the items in the set). This PEP preserves that property.

This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated (lightweight) lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.

Not all Python objects require locks. For example, immutable objects like tuples, strings, and numbers do not require a lock.

Per-object locks with critical sections provide weaker protections than the GIL. Because the GIL doesn’t necessarily ensure that concurrent operations are atomic or correct, the per-object locking scheme also cannot ensure that concurrent operations are atomic or correct. Instead, per-object locking aims for similar protections as the GIL, but with mutual exclusion limited to individual objects.

Most operations on an instance of a container type require locking that object. For example:

  • list.append, list.insert, list.repeat, PyList_SetItem
  • dict.__setitem__, PyDict_SetItem
  • list.clear, dict.clear
  • list.__repr__, dict.__repr__, etc.
  • list.extend(iterable)
  • setiter_iternext

Some operations operate directly on two container objects, with knowledge about both containers’ internal structure. For example, there are internal specializations of list.extend(iterable) for specific iterable types, like set. These operations need to lock both container objects because they access the internals of both objects simultaneously. Note that the generic implementation of list.extend only needs to lock one object (the list) because the other object is accessed indirectly through the thread-safe iterator API. Operations that lock two containers are:

  • list.extend(list), list.extend(set), list.extend (dictitems), and other specializations where the implementation is specialized for argument type.
  • list.concat(list)
  • list.__eq__(list), dict.__eq__(dict)

Some simple operations can be implemented directly with atomic accesses and do not need locks because they only access a single field. These operations include:

  • len(list) i.e., list_length(PyListObject *a)
  • len(dict)
  • len(set)

A select few operations optimistically avoid locking to improve performance. These require special implementations and cooperation from the memory allocator:

  • list[idx] (list_subscript)
  • dict[key] (dict_subscript)
  • listiter_next, dictiter_iternextkey/value/item
  • list.contains

Borrowed References

Per-object locking provides many of the important protections that the GIL provides, but there are a few cases where it’s not sufficient. For example, code that relies on upgrading a borrowed reference to an “owned” reference may be unsafe in certain circumstances:

PyObject *item = PyList_GetItem(list, idx);
Py_INCREF(item);

The GIL ensures that no other thread can modify the list in between the access and the Py_INCREF call. Without the GIL – even with per-object locking – another thread might modify the list leading to item being freed between the access and the Py_INCREF call.

The problematic borrowed reference APIs are supplemented with functions that return “new references” but are otherwise equivalent:

  • PyList_FetchItem(list, idx) for PyList_GetItem
  • PyDict_FetchItem(dict, key) for PyDict_GetItem
  • PyWeakref_FetchObject for PyWeakref_GetObject

Note that some APIs that return borrowed references, such as PyTuple_GetItem, are not problematic because tuples are immutable. Similarly, not all uses of the above APIs are problematic. For example, PyDict_GetItem is often used for parsing keyword argument dictionaries in function calls; those keyword argument dictionaries are effectively private (not accessible by other threads).

Python Critical Sections

Straightforward per-object locking could introduce deadlocks that were not present when running with the GIL. Threads may hold locks for multiple objects simultaneously because Python operations can nest. Operations on objects can invoke operations on other objects, acquiring multiple per-object locks. If threads try to acquire the same locks in different orders, they will deadlock.

This PEP proposes a scheme called “Python critical sections” to implicitly release per-object locks to avoid deadlocks. To understand the scheme, we first introduce a general approach to avoid deadlocks, and then propose a refinement of that approach with better performance.

One way to avoid deadlocks is to allow threads to hold only the lock (or locks) for a single operation at a time (typically a single lock, but some operations involve two locks as described above). When a thread begins a nested operation it should suspend the locks for any outer operation: before beginning the nested operation, the locks for the outer operation are released and when the nested operation completes, the locks for the outer operation are reacquired.

Additionally, the locks for any active operation should be suspended around potentially blocking operations, such as I/O (i.e., operations that would have released the GIL). This is because the interaction between locks and blocking operations can lead to deadlocks in the same way as the interaction between multiple locks.

To improve performance, this PEP proposes a variation of the above scheme that still avoids deadlocks. Instead of immediately suspending locks any time a nested operation begins, locks are only suspended if the thread would block (i.e., would have released the GIL). This reduces the number of lock acquisitions and releases for nested operations, while avoiding deadlocks.

The proposed API for Python critical sections are the following four macros. These are intended to be public (usable by C-API extensions), but not parted of the limited API:

  • Py_BEGIN_CRITICAL_SECTION(PyMutex m);: Begins a critical section by acquiring the mutex m. If m is already locked, then locks for any outstanding critical sections are released before this thread waits for m to be unlocked.
  • Py_END_CRITICAL_SECTION(PyMutex m);: Ends the most recent operation, unlocking the mutex m. The most recent previous critical section (if any) is resumed if it is currently suspended.
  • Py_BEGIN_CRITICAL_SECTION2(PyMutex m1, PyMutex m2);: Begins a critical section by acquiring the mutexes m1 and m2. To ensure consistent lock ordering, the order of acquisition is determined by memory address (i.e., the mutex with lower memory address is acquired first). If either mutex is already locked, then locks for any outstanding critical sections are released before this thread waits for m1 or m2 to be unlocked.
  • Py_END_CRITICAL_SECTION2(PyMutex m1, PyMutex m2);: Behaves the same as Py_END_CRITICAL_SECTION but unlocks two mutexes m1 and m2.

Additionally, when a thread transitions from the ATTACHED state to the DETACHED state, it should suspend any active critical sections. When transitioning from DETACHED to ATTACHED, the most recent suspended critical section, if any, should be resumed.

Optimistically Avoiding Locking

A few operations on dict and list optimistically avoid acquiring the per-object locks. They have a fast path operation that does not acquire locks, but may fall back to a slower operation that acquires the dictionary’s or list’s lock when another thread is concurrently modifying that container.

The operations with an optimistic fast path are:

  • PyDict_FetchItem/GetItem and dict.__getitem__
  • PyList_FetchItem/GetItem and list.__getitem__

Additionally, iterators for dict and list use the above functions so they also optimistically avoid locking when returning the next item.

There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.

The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).

This section describes the challenges with implementing dictionary and list accesses without locking followed by a description of this PEP’s changes to the Python interpreter required to address those challenges.

The main challenge is that retrieving an item from a list or dictionary and incrementing the reference count of that item is not an atomic operation. In between the time the item is retrieved and the reference count is incremented, another thread may modify the list or dictionary, possibly freeing the memory for the previously retrieved item.

A partial attempt at addressing this issue would be to convert the reference count increment to a conditional increment, only incrementing the reference count if it’s not zero. This change is not sufficient because when a Python object’s reference count reaches zero, the object’s destructor is called and the memory storing the object may be re-used for other data structures or returned to the operating system. Instead, this PEP proposes a technique to ensure that the reference count fields remain valid for the duration of the access, so that the conditional reference count increment is safe. This technique requires cooperation from the memory allocator (mimalloc) as well as changes to the list and dictionary objects. The proposed technique is similar to read-copy update (RCU) [6], a synchronization mechanism widely used in the Linux kernel.

The current implementation of list_item (the C function implementing list.__getitem__) is the following:

Py_INCREF(a->ob_item[i]);
return a->ob_item[i];

The proposed implementation uses the conditional increment (_Py_TRY_INCREF) and has additional checks:

 PyObject **ob_item = atomic_load(&a->ob_item);
 PyObject *item = atomic_load(&ob_item[i]);
 if (!item || !_Py_TRY_INCREF(item)) goto retry;
 if (item != atomic_load(&ob_item[i])) {
   Py_DECREF(item);
   goto retry;
 }
 if (ob_item != atomic_load(&a->ob_item)) {
   Py_DECREF(item);
   goto retry;
}
return item;

The “retry” subroutine implements the locked fallback path when concurrent modifications to the list cause the above fast, non-locking path to fail:

retry:
  PyObject *item;
  Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
  item = a->ob_item[i];
  Py_INCREF(item);
  Py_END_CRITICAL_SECTION(a->ob_mutex);
  return item;

The modifications to the dict implementation are similar, because the relevant parts of both list and dictionary retrieval involve loading an item/value from an array at a known index.

The additional checks following the conditional increment are necessary because the scheme allows immediate re-use of memory, including the memory that previously held a PyObject structure or list or dict array. Without these extra checks, the function might return a Python object that was never in the list, if the memory occupied by the Python object previously held a different PyObject whose memory previously stored an item in the list.

Mimalloc Changes for Optimistic list and dict Access

The implementation requires additional constraints to the memory allocator, including some changes to the mimalloc code. Some background on mimalloc’s implementation is helpful to understand the required changes. Individual allocations from mimalloc are called “blocks.” Mimalloc “pages” contain consecutive blocks that are all the same size. A mimalloc “page” is similar to a “superblock” in other allocators; it is NOT an operating system page. A mimalloc “heap” contains pages of various size classes; each page belongs to a single heap. If none of the blocks of a page are allocated, then mimalloc may re-use the page for a different size class or different heap (i.e., it might reinitialize the page).

The list and dictionary access scheme works by partially restricting re-use of mimalloc pages so that reference count fields remains valid for the duration of the access. The restricted re-use of mimalloc pages is enforced by having separate heaps for Python objects [7]. This ensures that even if an item is freed during access and the memory reused for a new object, the new object’s reference count field is placed at the same location in memory. The reference count field remains valid (or zero) across allocations.

Python objects that support cyclic garbage collection have two extra fields preceding the PyObject header, so their reference count fields are at a different offset from the start of their allocations. There are therefore two mimalloc heaps for Python objects, one for objects that support GC and one for objects that do not.

The backing arrays for lists and the PyDictKeysObject [8] for dictionaries face hazards similar to those of Python objects. Lists and dictionaries may be resized concurrently with accesses, reallocating the backing array or keys object. Thus, there are two additional mimalloc heaps: one for list arrays and one for dictionary keys objects. In total, there are five mimalloc heaps: two for Python objects (GC and non-GC), one for list arrays, one for dictionary keys, and the default mimalloc heap used for other allocations.

Mimalloc Page Reuse

It is beneficial to keep the restrictions on mimalloc page reuse to a short period of time to avoid increasing overall memory usage. Precisely limiting the restrictions to list and dictionary accesses would minimize memory usage, but would require expensive synchronizations. At the other extreme, keeping the restrictions until the next GC cycle would avoid introducing any extra synchronizations, but would potentially increase memory usage.

This PEP proposes a system that lies between those two extremes based on FreeBSD’s “GUS” [9]. It uses a combination of global and per-thread counters (or “sequence numbers”) to coordinate the determination of when it is safe to reuse an empty mimalloc page for a different heap or for a different size class, or to return it to the operating system:

  • There is a global write sequence number that monotonically increases.
  • When a mimalloc page is empty, it’s tagged with the current write sequence number. The thread may also atomically increment the global write sequence number.
  • Each thread has a local read sequence number that records the most recent write sequence number it has observed.
  • Threads may observe the write sequence number whenever they are not in a list or dictionary access. The reference implementation does this in mimalloc’s slow-path allocation function. This is called regularly enough to be useful, but not so frequently as to introduce significant overhead.
  • There is a global read sequence number that stores the minimum of all active threads’ read sequence numbers. A thread may update the global read sequence number by scanning each threads’ local read sequence number. The reference implementation does this before allocating a fresh mimalloc page if there are restricted pages that could possibly be reused.
  • An empty mimalloc page may be reused for a different heap or size class when the global read sequence number is larger than the page’s tag number.

The condition that the global read sequence number is larger than the page’s tag is sufficient because it ensures that any thread that had a concurrent optimistic list or dictionary access is finished with that access. In other words, there are no threads accessing the empty blocks in the freed page, so the page can be used for any other purpose or even returned to the operating system.

Optimistic dict and list Access Summary

This PEP proposes a technique for thread-safe list and dictionary accesses that typically avoids acquiring locks. This reduces execution overhead and avoids some multi-threaded scaling bottlenecks in common operations, like calling functions and methods. The scheme works by placing temporary restrictions on mimalloc page reuse to ensure that objects’ reference count fields remain valid after objects are freed so that conditional reference count increment operations are safe. The restrictions are placed on mimalloc pages instead of on individual objects to improve opportunities for memory reuse. The restrictions are lifted as soon as the system can determine that there are no outstanding accesses involving the empty mimalloc page. To determine this, the system uses a combination of lightweight per-thread sequence counters and also tags pages when they are empty. Once each thread’s local counter is larger than the page’s tag, it can be reused for any purpose or returned to the operating system. The restrictions are also lifted whenever the cyclic garbage collector runs because the stop-the-world pause ensures that threads do not have any outstanding references to empty mimalloc pages.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK