| Using RCU (Read-Copy-Update) for synchronization |
| ================================================ |
| |
| Read-copy update (RCU) is a synchronization mechanism that is used to |
| protect read-mostly data structures. RCU is very efficient and scalable |
| on the read side (it is wait-free), and thus can make the read paths |
| extremely fast. |
| |
| RCU supports concurrency between a single writer and multiple readers, |
| thus it is not used alone. Typically, the write-side will use a lock to |
| serialize multiple updates, but other approaches are possible (e.g., |
| restricting updates to a single task). In QEMU, when a lock is used, |
| this will often be the "iothread mutex", also known as the "big QEMU |
| lock" (BQL). Also, restricting updates to a single task is done in |
| QEMU using the "bottom half" API. |
| |
| RCU is fundamentally a "wait-to-finish" mechanism. The read side marks |
| sections of code with "critical sections", and the update side will wait |
| for the execution of all *currently running* critical sections before |
| proceeding, or before asynchronously executing a callback. |
| |
| The key point here is that only the currently running critical sections |
| are waited for; critical sections that are started **after** the beginning |
| of the wait do not extend the wait, despite running concurrently with |
| the updater. This is the reason why RCU is more scalable than, |
| for example, reader-writer locks. It is so much more scalable that |
| the system will have a single instance of the RCU mechanism; a single |
| mechanism can be used for an arbitrary number of "things", without |
| having to worry about things such as contention or deadlocks. |
| |
| How is this possible? The basic idea is to split updates in two phases, |
| "removal" and "reclamation". During removal, we ensure that subsequent |
| readers will not be able to get a reference to the old data. After |
| removal has completed, a critical section will not be able to access |
| the old data. Therefore, critical sections that begin after removal |
| do not matter; as soon as all previous critical sections have finished, |
| there cannot be any readers who hold references to the data structure, |
| and these can now be safely reclaimed (e.g., freed or unref'ed). |
| |
| Here is a picture:: |
| |
| thread 1 thread 2 thread 3 |
| ------------------- ------------------------ ------------------- |
| enter RCU crit.sec. |
| | finish removal phase |
| | begin wait |
| | | enter RCU crit.sec. |
| exit RCU crit.sec | | |
| complete wait | |
| begin reclamation phase | |
| exit RCU crit.sec. |
| |
| |
| Note how thread 3 is still executing its critical section when thread 2 |
| starts reclaiming data. This is possible, because the old version of the |
| data structure was not accessible at the time thread 3 began executing |
| that critical section. |
| |
| |
| RCU API |
| ------- |
| |
| The core RCU API is small: |
| |
| ``void rcu_read_lock(void);`` |
| Used by a reader to inform the reclaimer that the reader is |
| entering an RCU read-side critical section. |
| |
| ``void rcu_read_unlock(void);`` |
| Used by a reader to inform the reclaimer that the reader is |
| exiting an RCU read-side critical section. Note that RCU |
| read-side critical sections may be nested and/or overlapping. |
| |
| ``void synchronize_rcu(void);`` |
| Blocks until all pre-existing RCU read-side critical sections |
| on all threads have completed. This marks the end of the removal |
| phase and the beginning of reclamation phase. |
| |
| Note that it would be valid for another update to come while |
| ``synchronize_rcu`` is running. Because of this, it is better that |
| the updater releases any locks it may hold before calling |
| ``synchronize_rcu``. If this is not possible (for example, because |
| the updater is protected by the BQL), you can use ``call_rcu``. |
| |
| ``void call_rcu1(struct rcu_head * head, void (*func)(struct rcu_head *head));`` |
| This function invokes ``func(head)`` after all pre-existing RCU |
| read-side critical sections on all threads have completed. This |
| marks the end of the removal phase, with func taking care |
| asynchronously of the reclamation phase. |
| |
| The ``foo`` struct needs to have an ``rcu_head`` structure added, |
| perhaps as follows:: |
| |
| struct foo { |
| struct rcu_head rcu; |
| int a; |
| char b; |
| long c; |
| }; |
| |
| so that the reclaimer function can fetch the ``struct foo`` address |
| and free it:: |
| |
| call_rcu1(&foo.rcu, foo_reclaim); |
| |
| void foo_reclaim(struct rcu_head *rp) |
| { |
| struct foo *fp = container_of(rp, struct foo, rcu); |
| g_free(fp); |
| } |
| |
| ``call_rcu1`` is typically used via either the ``call_rcu`` or |
| ``g_free_rcu`` macros, which handle the common case where the |
| ``rcu_head`` member is the first of the struct. |
| |
| ``void call_rcu(T *p, void (*func)(T *p), field-name);`` |
| If the ``struct rcu_head`` is the first field in the struct, you can |
| use this macro instead of ``call_rcu1``. |
| |
| ``void g_free_rcu(T *p, field-name);`` |
| This is a special-case version of ``call_rcu`` where the callback |
| function is ``g_free``. |
| In the example given in ``call_rcu1``, one could have written simply:: |
| |
| g_free_rcu(&foo, rcu); |
| |
| ``typeof(*p) qatomic_rcu_read(p);`` |
| ``qatomic_rcu_read()`` is similar to ``qatomic_load_acquire()``, but |
| it makes some assumptions on the code that calls it. This allows a |
| more optimized implementation. |
| |
| ``qatomic_rcu_read`` assumes that whenever a single RCU critical |
| section reads multiple shared data, these reads are either |
| data-dependent or need no ordering. This is almost always the |
| case when using RCU, because read-side critical sections typically |
| navigate one or more pointers (the pointers that are changed on |
| every update) until reaching a data structure of interest, |
| and then read from there. |
| |
| RCU read-side critical sections must use ``qatomic_rcu_read()`` to |
| read data, unless concurrent writes are prevented by another |
| synchronization mechanism. |
| |
| Furthermore, RCU read-side critical sections should traverse the |
| data structure in a single direction, opposite to the direction |
| in which the updater initializes it. |
| |
| ``void qatomic_rcu_set(p, typeof(*p) v);`` |
| ``qatomic_rcu_set()`` is similar to ``qatomic_store_release()``, |
| though it also makes assumptions on the code that calls it in |
| order to allow a more optimized implementation. |
| |
| In particular, ``qatomic_rcu_set()`` suffices for synchronization |
| with readers, if the updater never mutates a field within a |
| data item that is already accessible to readers. This is the |
| case when initializing a new copy of the RCU-protected data |
| structure; just ensure that initialization of ``*p`` is carried out |
| before ``qatomic_rcu_set()`` makes the data item visible to readers. |
| If this rule is observed, writes will happen in the opposite |
| order as reads in the RCU read-side critical sections (or if |
| there is just one update), and there will be no need for other |
| synchronization mechanism to coordinate the accesses. |
| |
| The following APIs must be used before RCU is used in a thread: |
| |
| ``void rcu_register_thread(void);`` |
| Mark a thread as taking part in the RCU mechanism. Such a thread |
| will have to report quiescent points regularly, either manually |
| or through the ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs. |
| |
| ``void rcu_unregister_thread(void);`` |
| Mark a thread as not taking part anymore in the RCU mechanism. |
| It is not a problem if such a thread reports quiescent points, |
| either manually or by using the |
| ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs. |
| |
| Note that these APIs are relatively heavyweight, and should **not** be |
| nested. |
| |
| Convenience macros |
| ------------------ |
| |
| Two macros are provided that automatically release the read lock at the |
| end of the scope. |
| |
| ``RCU_READ_LOCK_GUARD()`` |
| Takes the lock and will release it at the end of the block it's |
| used in. |
| |
| ``WITH_RCU_READ_LOCK_GUARD() { code }`` |
| Is used at the head of a block to protect the code within the block. |
| |
| Note that a ``goto`` out of the guarded block will also drop the lock. |
| |
| Differences with Linux |
| ---------------------- |
| |
| - Waiting on a mutex is possible, though discouraged, within an RCU critical |
| section. This is because spinlocks are rarely (if ever) used in userspace |
| programming; not allowing this would prevent upgrading an RCU read-side |
| critical section to become an updater. |
| |
| - ``qatomic_rcu_read`` and ``qatomic_rcu_set`` replace ``rcu_dereference`` and |
| ``rcu_assign_pointer``. They take a **pointer** to the variable being accessed. |
| |
| - ``call_rcu`` is a macro that has an extra argument (the name of the first |
| field in the struct, which must be a struct ``rcu_head``), and expects the |
| type of the callback's argument to be the type of the first argument. |
| ``call_rcu1`` is the same as Linux's ``call_rcu``. |
| |
| |
| RCU Patterns |
| ------------ |
| |
| Many patterns using read-writer locks translate directly to RCU, with |
| the advantages of higher scalability and deadlock immunity. |
| |
| In general, RCU can be used whenever it is possible to create a new |
| "version" of a data structure every time the updater runs. This may |
| sound like a very strict restriction, however: |
| |
| - the updater does not mean "everything that writes to a data structure", |
| but rather "everything that involves a reclamation step". See the |
| array example below |
| |
| - in some cases, creating a new version of a data structure may actually |
| be very cheap. For example, modifying the "next" pointer of a singly |
| linked list is effectively creating a new version of the list. |
| |
| Here are some frequently-used RCU idioms that are worth noting. |
| |
| |
| RCU list processing |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| TBD (not yet used in QEMU) |
| |
| |
| RCU reference counting |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Because grace periods are not allowed to complete while there is an RCU |
| read-side critical section in progress, the RCU read-side primitives |
| may be used as a restricted reference-counting mechanism. For example, |
| consider the following code fragment:: |
| |
| rcu_read_lock(); |
| p = qatomic_rcu_read(&foo); |
| /* do something with p. */ |
| rcu_read_unlock(); |
| |
| The RCU read-side critical section ensures that the value of ``p`` remains |
| valid until after the ``rcu_read_unlock()``. In some sense, it is acquiring |
| a reference to ``p`` that is later released when the critical section ends. |
| The write side looks simply like this (with appropriate locking):: |
| |
| qemu_mutex_lock(&foo_mutex); |
| old = foo; |
| qatomic_rcu_set(&foo, new); |
| qemu_mutex_unlock(&foo_mutex); |
| synchronize_rcu(); |
| free(old); |
| |
| If the processing cannot be done purely within the critical section, it |
| is possible to combine this idiom with a "real" reference count:: |
| |
| rcu_read_lock(); |
| p = qatomic_rcu_read(&foo); |
| foo_ref(p); |
| rcu_read_unlock(); |
| /* do something with p. */ |
| foo_unref(p); |
| |
| The write side can be like this:: |
| |
| qemu_mutex_lock(&foo_mutex); |
| old = foo; |
| qatomic_rcu_set(&foo, new); |
| qemu_mutex_unlock(&foo_mutex); |
| synchronize_rcu(); |
| foo_unref(old); |
| |
| or with ``call_rcu``:: |
| |
| qemu_mutex_lock(&foo_mutex); |
| old = foo; |
| qatomic_rcu_set(&foo, new); |
| qemu_mutex_unlock(&foo_mutex); |
| call_rcu(foo_unref, old, rcu); |
| |
| In both cases, the write side only performs removal. Reclamation |
| happens when the last reference to a ``foo`` object is dropped. |
| Using ``synchronize_rcu()`` is undesirably expensive, because the |
| last reference may be dropped on the read side. Hence you can |
| use ``call_rcu()`` instead:: |
| |
| foo_unref(struct foo *p) { |
| if (qatomic_fetch_dec(&p->refcount) == 1) { |
| call_rcu(foo_destroy, p, rcu); |
| } |
| } |
| |
| |
| Note that the same idioms would be possible with reader/writer |
| locks:: |
| |
| read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); |
| p = foo; p = foo; |
| /* do something with p. */ foo = new; |
| read_unlock(&foo_rwlock); free(p); |
| write_mutex_unlock(&foo_rwlock); |
| free(p); |
| |
| ------------------------------------------------------------------ |
| |
| read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); |
| p = foo; old = foo; |
| foo_ref(p); foo = new; |
| read_unlock(&foo_rwlock); foo_unref(old); |
| /* do something with p. */ write_mutex_unlock(&foo_rwlock); |
| read_lock(&foo_rwlock); |
| foo_unref(p); |
| read_unlock(&foo_rwlock); |
| |
| ``foo_unref`` could use a mechanism such as bottom halves to move deallocation |
| out of the write-side critical section. |
| |
| |
| RCU resizable arrays |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| Resizable arrays can be used with RCU. The expensive RCU synchronization |
| (or ``call_rcu``) only needs to take place when the array is resized. |
| The two items to take care of are: |
| |
| - ensuring that the old version of the array is available between removal |
| and reclamation; |
| |
| - avoiding mismatches in the read side between the array data and the |
| array size. |
| |
| The first problem is avoided simply by not using ``realloc``. Instead, |
| each resize will allocate a new array and copy the old data into it. |
| The second problem would arise if the size and the data pointers were |
| two members of a larger struct:: |
| |
| struct mystuff { |
| ... |
| int data_size; |
| int data_alloc; |
| T *data; |
| ... |
| }; |
| |
| Instead, we store the size of the array with the array itself:: |
| |
| struct arr { |
| int size; |
| int alloc; |
| T data[]; |
| }; |
| struct arr *global_array; |
| |
| read side: |
| rcu_read_lock(); |
| struct arr *array = qatomic_rcu_read(&global_array); |
| x = i < array->size ? array->data[i] : -1; |
| rcu_read_unlock(); |
| return x; |
| |
| write side (running under a lock): |
| if (global_array->size == global_array->alloc) { |
| /* Creating a new version. */ |
| new_array = g_malloc(sizeof(struct arr) + |
| global_array->alloc * 2 * sizeof(T)); |
| new_array->size = global_array->size; |
| new_array->alloc = global_array->alloc * 2; |
| memcpy(new_array->data, global_array->data, |
| global_array->alloc * sizeof(T)); |
| |
| /* Removal phase. */ |
| old_array = global_array; |
| qatomic_rcu_set(&global_array, new_array); |
| synchronize_rcu(); |
| |
| /* Reclamation phase. */ |
| free(old_array); |
| } |
| |
| |
| References |
| ---------- |
| |
| * The `Linux kernel RCU documentation <https://docs.kernel.org/RCU/>`__ |