| Locked Counters (aka ``QemuLockCnt``) |
| ===================================== |
| |
| QEMU often uses reference counts to track data structures that are being |
| accessed and should not be freed. For example, a loop that invoke |
| callbacks like this is not safe:: |
| |
| QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { |
| if (ioh->revents & G_IO_OUT) { |
| ioh->fd_write(ioh->opaque); |
| } |
| } |
| |
| ``QLIST_FOREACH_SAFE`` protects against deletion of the current node (``ioh``) |
| by stashing away its ``next`` pointer. However, ``ioh->fd_write`` could |
| actually delete the next node from the list. The simplest way to |
| avoid this is to mark the node as deleted, and remove it from the |
| list in the above loop:: |
| |
| QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { |
| if (ioh->deleted) { |
| QLIST_REMOVE(ioh, next); |
| g_free(ioh); |
| } else { |
| if (ioh->revents & G_IO_OUT) { |
| ioh->fd_write(ioh->opaque); |
| } |
| } |
| } |
| |
| If however this loop must also be reentrant, i.e. it is possible that |
| ``ioh->fd_write`` invokes the loop again, some kind of counting is needed:: |
| |
| walking_handlers++; |
| QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { |
| if (ioh->deleted) { |
| if (walking_handlers == 1) { |
| QLIST_REMOVE(ioh, next); |
| g_free(ioh); |
| } |
| } else { |
| if (ioh->revents & G_IO_OUT) { |
| ioh->fd_write(ioh->opaque); |
| } |
| } |
| } |
| walking_handlers--; |
| |
| One may think of using the RCU primitives, ``rcu_read_lock()`` and |
| ``rcu_read_unlock()``; effectively, the RCU nesting count would take |
| the place of the walking_handlers global variable. Indeed, |
| reference counting and RCU have similar purposes, but their usage in |
| general is complementary: |
| |
| - reference counting is fine-grained and limited to a single data |
| structure; RCU delays reclamation of *all* RCU-protected data |
| structures; |
| |
| - reference counting works even in the presence of code that keeps |
| a reference for a long time; RCU critical sections in principle |
| should be kept short; |
| |
| - reference counting is often applied to code that is not thread-safe |
| but is reentrant; in fact, usage of reference counting in QEMU predates |
| the introduction of threads by many years. RCU is generally used to |
| protect readers from other threads freeing memory after concurrent |
| modifications to a data structure. |
| |
| - reclaiming data can be done by a separate thread in the case of RCU; |
| this can improve performance, but also delay reclamation undesirably. |
| With reference counting, reclamation is deterministic. |
| |
| This file documents ``QemuLockCnt``, an abstraction for using reference |
| counting in code that has to be both thread-safe and reentrant. |
| |
| |
| ``QemuLockCnt`` concepts |
| ------------------------ |
| |
| A ``QemuLockCnt`` comprises both a counter and a mutex; it has primitives |
| to increment and decrement the counter, and to take and release the |
| mutex. The counter notes how many visits to the data structures are |
| taking place (the visits could be from different threads, or there could |
| be multiple reentrant visits from the same thread). The basic rules |
| governing the counter/mutex pair then are the following: |
| |
| - Data protected by the QemuLockCnt must not be freed unless the |
| counter is zero and the mutex is taken. |
| |
| - A new visit cannot be started while the counter is zero and the |
| mutex is taken. |
| |
| Most of the time, the mutex protects all writes to the data structure, |
| not just frees, though there could be cases where this is not necessary. |
| |
| Reads, instead, can be done without taking the mutex, as long as the |
| readers and writers use the same macros that are used for RCU, for |
| example ``qatomic_rcu_read``, ``qatomic_rcu_set``, ``QLIST_FOREACH_RCU``, |
| etc. This is because the reads are done outside a lock and a set |
| or ``QLIST_INSERT_HEAD`` |
| can happen concurrently with the read. The RCU API ensures that the |
| processor and the compiler see all required memory barriers. |
| |
| This could be implemented simply by protecting the counter with the |
| mutex, for example:: |
| |
| // (1) |
| qemu_mutex_lock(&walking_handlers_mutex); |
| walking_handlers++; |
| qemu_mutex_unlock(&walking_handlers_mutex); |
| |
| ... |
| |
| // (2) |
| qemu_mutex_lock(&walking_handlers_mutex); |
| if (--walking_handlers == 0) { |
| QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { |
| if (ioh->deleted) { |
| QLIST_REMOVE(ioh, next); |
| g_free(ioh); |
| } |
| } |
| } |
| qemu_mutex_unlock(&walking_handlers_mutex); |
| |
| Here, no frees can happen in the code represented by the ellipsis. |
| If another thread is executing critical section (2), that part of |
| the code cannot be entered, because the thread will not be able |
| to increment the ``walking_handlers`` variable. And of course |
| during the visit any other thread will see a nonzero value for |
| ``walking_handlers``, as in the single-threaded code. |
| |
| Note that it is possible for multiple concurrent accesses to delay |
| the cleanup arbitrarily; in other words, for the ``walking_handlers`` |
| counter to never become zero. For this reason, this technique is |
| more easily applicable if concurrent access to the structure is rare. |
| |
| However, critical sections are easy to forget since you have to do |
| them for each modification of the counter. ``QemuLockCnt`` ensures that |
| all modifications of the counter take the lock appropriately, and it |
| can also be more efficient in two ways: |
| |
| - it avoids taking the lock for many operations (for example |
| incrementing the counter while it is non-zero); |
| |
| - on some platforms, one can implement ``QemuLockCnt`` to hold the lock |
| and the mutex in a single word, making the fast path no more expensive |
| than simply managing a counter using atomic operations (see |
| :doc:`atomics`). This can be very helpful if concurrent access to |
| the data structure is expected to be rare. |
| |
| |
| Using the same mutex for frees and writes can still incur some small |
| inefficiencies; for example, a visit can never start if the counter is |
| zero and the mutex is taken -- even if the mutex is taken by a write, |
| which in principle need not block a visit of the data structure. |
| However, these are usually not a problem if any of the following |
| assumptions are valid: |
| |
| - concurrent access is possible but rare |
| |
| - writes are rare |
| |
| - writes are frequent, but this kind of write (e.g. appending to a |
| list) has a very small critical section. |
| |
| For example, QEMU uses ``QemuLockCnt`` to manage an ``AioContext``'s list of |
| bottom halves and file descriptor handlers. Modifications to the list |
| of file descriptor handlers are rare. Creation of a new bottom half is |
| frequent and can happen on a fast path; however: 1) it is almost never |
| concurrent with a visit to the list of bottom halves; 2) it only has |
| three instructions in the critical path, two assignments and a ``smp_wmb()``. |
| |
| |
| ``QemuLockCnt`` API |
| ------------------- |
| |
| .. kernel-doc:: include/qemu/lockcnt.h |
| |
| |
| ``QemuLockCnt`` usage |
| --------------------- |
| |
| This section explains the typical usage patterns for ``QemuLockCnt`` functions. |
| |
| Setting a variable to a non-NULL value can be done between |
| ``qemu_lockcnt_lock`` and ``qemu_lockcnt_unlock``:: |
| |
| qemu_lockcnt_lock(&xyz_lockcnt); |
| if (!xyz) { |
| new_xyz = g_new(XYZ, 1); |
| ... |
| qatomic_rcu_set(&xyz, new_xyz); |
| } |
| qemu_lockcnt_unlock(&xyz_lockcnt); |
| |
| Accessing the value can be done between ``qemu_lockcnt_inc`` and |
| ``qemu_lockcnt_dec``:: |
| |
| qemu_lockcnt_inc(&xyz_lockcnt); |
| if (xyz) { |
| XYZ *p = qatomic_rcu_read(&xyz); |
| ... |
| /* Accesses can now be done through "p". */ |
| } |
| qemu_lockcnt_dec(&xyz_lockcnt); |
| |
| Freeing the object can similarly use ``qemu_lockcnt_lock`` and |
| ``qemu_lockcnt_unlock``, but you also need to ensure that the count |
| is zero (i.e. there is no concurrent visit). Because ``qemu_lockcnt_inc`` |
| takes the ``QemuLockCnt``'s lock, the count cannot become non-zero while |
| the object is being freed. Freeing an object looks like this:: |
| |
| qemu_lockcnt_lock(&xyz_lockcnt); |
| if (!qemu_lockcnt_count(&xyz_lockcnt)) { |
| g_free(xyz); |
| xyz = NULL; |
| } |
| qemu_lockcnt_unlock(&xyz_lockcnt); |
| |
| If an object has to be freed right after a visit, you can combine |
| the decrement, the locking and the check on count as follows:: |
| |
| qemu_lockcnt_inc(&xyz_lockcnt); |
| if (xyz) { |
| XYZ *p = qatomic_rcu_read(&xyz); |
| ... |
| /* Accesses can now be done through "p". */ |
| } |
| if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) { |
| g_free(xyz); |
| xyz = NULL; |
| qemu_lockcnt_unlock(&xyz_lockcnt); |
| } |
| |
| ``QemuLockCnt`` can also be used to access a list as follows:: |
| |
| qemu_lockcnt_inc(&io_handlers_lockcnt); |
| QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) { |
| if (ioh->revents & G_IO_OUT) { |
| ioh->fd_write(ioh->opaque); |
| } |
| } |
| |
| if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) { |
| QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) { |
| if (ioh->deleted) { |
| QLIST_REMOVE(ioh, next); |
| g_free(ioh); |
| } |
| } |
| qemu_lockcnt_unlock(&io_handlers_lockcnt); |
| } |
| |
| Again, the RCU primitives are used because new items can be added to the |
| list during the walk. ``QLIST_FOREACH_RCU`` ensures that the processor and |
| the compiler see the appropriate memory barriers. |
| |
| An alternative pattern uses ``qemu_lockcnt_dec_if_lock``:: |
| |
| qemu_lockcnt_inc(&io_handlers_lockcnt); |
| QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) { |
| if (ioh->deleted) { |
| if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) { |
| QLIST_REMOVE(ioh, next); |
| g_free(ioh); |
| qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt); |
| } |
| } else { |
| if (ioh->revents & G_IO_OUT) { |
| ioh->fd_write(ioh->opaque); |
| } |
| } |
| } |
| qemu_lockcnt_dec(&io_handlers_lockcnt); |
| |
| Here you can use ``qemu_lockcnt_dec`` instead of ``qemu_lockcnt_dec_and_lock``, |
| because there is no special task to do if the count goes from 1 to 0. |