|  | Using RCU (Read-Copy-Update) for synchronization | 
|  | ================================================ | 
|  |  | 
|  | Read-copy update (RCU) is a synchronization mechanism that is used to | 
|  | protect read-mostly data structures.  RCU is very efficient and scalable | 
|  | on the read side (it is wait-free), and thus can make the read paths | 
|  | extremely fast. | 
|  |  | 
|  | RCU supports concurrency between a single writer and multiple readers, | 
|  | thus it is not used alone.  Typically, the write-side will use a lock to | 
|  | serialize multiple updates, but other approaches are possible (e.g., | 
|  | restricting updates to a single task).  In QEMU, when a lock is used, | 
|  | this will often be the "iothread mutex", also known as the "big QEMU | 
|  | lock" (BQL).  Also, restricting updates to a single task is done in | 
|  | QEMU using the "bottom half" API. | 
|  |  | 
|  | RCU is fundamentally a "wait-to-finish" mechanism.  The read side marks | 
|  | sections of code with "critical sections", and the update side will wait | 
|  | for the execution of all *currently running* critical sections before | 
|  | proceeding, or before asynchronously executing a callback. | 
|  |  | 
|  | The key point here is that only the currently running critical sections | 
|  | are waited for; critical sections that are started **after** the beginning | 
|  | of the wait do not extend the wait, despite running concurrently with | 
|  | the updater.  This is the reason why RCU is more scalable than, | 
|  | for example, reader-writer locks.  It is so much more scalable that | 
|  | the system will have a single instance of the RCU mechanism; a single | 
|  | mechanism can be used for an arbitrary number of "things", without | 
|  | having to worry about things such as contention or deadlocks. | 
|  |  | 
|  | How is this possible?  The basic idea is to split updates in two phases, | 
|  | "removal" and "reclamation".  During removal, we ensure that subsequent | 
|  | readers will not be able to get a reference to the old data.  After | 
|  | removal has completed, a critical section will not be able to access | 
|  | the old data.  Therefore, critical sections that begin after removal | 
|  | do not matter; as soon as all previous critical sections have finished, | 
|  | there cannot be any readers who hold references to the data structure, | 
|  | and these can now be safely reclaimed (e.g., freed or unref'ed). | 
|  |  | 
|  | Here is a picture:: | 
|  |  | 
|  | thread 1                  thread 2                  thread 3 | 
|  | -------------------    ------------------------    ------------------- | 
|  | enter RCU crit.sec. | 
|  | |                finish removal phase | 
|  | |                begin wait | 
|  | |                      |                    enter RCU crit.sec. | 
|  | exit RCU crit.sec             |                           | | 
|  | complete wait                     | | 
|  | begin reclamation phase           | | 
|  | exit RCU crit.sec. | 
|  |  | 
|  |  | 
|  | Note how thread 3 is still executing its critical section when thread 2 | 
|  | starts reclaiming data.  This is possible, because the old version of the | 
|  | data structure was not accessible at the time thread 3 began executing | 
|  | that critical section. | 
|  |  | 
|  |  | 
|  | RCU API | 
|  | ------- | 
|  |  | 
|  | The core RCU API is small: | 
|  |  | 
|  | ``void rcu_read_lock(void);`` | 
|  | Used by a reader to inform the reclaimer that the reader is | 
|  | entering an RCU read-side critical section. | 
|  |  | 
|  | ``void rcu_read_unlock(void);`` | 
|  | Used by a reader to inform the reclaimer that the reader is | 
|  | exiting an RCU read-side critical section.  Note that RCU | 
|  | read-side critical sections may be nested and/or overlapping. | 
|  |  | 
|  | ``void synchronize_rcu(void);`` | 
|  | Blocks until all pre-existing RCU read-side critical sections | 
|  | on all threads have completed.  This marks the end of the removal | 
|  | phase and the beginning of reclamation phase. | 
|  |  | 
|  | Note that it would be valid for another update to come while | 
|  | ``synchronize_rcu`` is running.  Because of this, it is better that | 
|  | the updater releases any locks it may hold before calling | 
|  | ``synchronize_rcu``.  If this is not possible (for example, because | 
|  | the updater is protected by the BQL), you can use ``call_rcu``. | 
|  |  | 
|  | ``void call_rcu1(struct rcu_head * head, void (*func)(struct rcu_head *head));`` | 
|  | This function invokes ``func(head)`` after all pre-existing RCU | 
|  | read-side critical sections on all threads have completed.  This | 
|  | marks the end of the removal phase, with func taking care | 
|  | asynchronously of the reclamation phase. | 
|  |  | 
|  | The ``foo`` struct needs to have an ``rcu_head`` structure added, | 
|  | perhaps as follows:: | 
|  |  | 
|  | struct foo { | 
|  | struct rcu_head rcu; | 
|  | int a; | 
|  | char b; | 
|  | long c; | 
|  | }; | 
|  |  | 
|  | so that the reclaimer function can fetch the ``struct foo`` address | 
|  | and free it:: | 
|  |  | 
|  | call_rcu1(&foo.rcu, foo_reclaim); | 
|  |  | 
|  | void foo_reclaim(struct rcu_head *rp) | 
|  | { | 
|  | struct foo *fp = container_of(rp, struct foo, rcu); | 
|  | g_free(fp); | 
|  | } | 
|  |  | 
|  | ``call_rcu1`` is typically used via either the ``call_rcu`` or | 
|  | ``g_free_rcu`` macros, which handle the common case where the | 
|  | ``rcu_head`` member is the first of the struct. | 
|  |  | 
|  | ``void call_rcu(T *p, void (*func)(T *p), field-name);`` | 
|  | If the ``struct rcu_head`` is the first field in the struct, you can | 
|  | use this macro instead of ``call_rcu1``. | 
|  |  | 
|  | ``void g_free_rcu(T *p, field-name);`` | 
|  | This is a special-case version of ``call_rcu`` where the callback | 
|  | function is ``g_free``. | 
|  | In the example given in ``call_rcu1``, one could have written simply:: | 
|  |  | 
|  | g_free_rcu(&foo, rcu); | 
|  |  | 
|  | ``typeof(*p) qatomic_rcu_read(p);`` | 
|  | ``qatomic_rcu_read()`` is similar to ``qatomic_load_acquire()``, but | 
|  | it makes some assumptions on the code that calls it.  This allows a | 
|  | more optimized implementation. | 
|  |  | 
|  | ``qatomic_rcu_read`` assumes that whenever a single RCU critical | 
|  | section reads multiple shared data, these reads are either | 
|  | data-dependent or need no ordering.  This is almost always the | 
|  | case when using RCU, because read-side critical sections typically | 
|  | navigate one or more pointers (the pointers that are changed on | 
|  | every update) until reaching a data structure of interest, | 
|  | and then read from there. | 
|  |  | 
|  | RCU read-side critical sections must use ``qatomic_rcu_read()`` to | 
|  | read data, unless concurrent writes are prevented by another | 
|  | synchronization mechanism. | 
|  |  | 
|  | Furthermore, RCU read-side critical sections should traverse the | 
|  | data structure in a single direction, opposite to the direction | 
|  | in which the updater initializes it. | 
|  |  | 
|  | ``void qatomic_rcu_set(p, typeof(*p) v);`` | 
|  | ``qatomic_rcu_set()`` is similar to ``qatomic_store_release()``, | 
|  | though it also makes assumptions on the code that calls it in | 
|  | order to allow a more optimized implementation. | 
|  |  | 
|  | In particular, ``qatomic_rcu_set()`` suffices for synchronization | 
|  | with readers, if the updater never mutates a field within a | 
|  | data item that is already accessible to readers.  This is the | 
|  | case when initializing a new copy of the RCU-protected data | 
|  | structure; just ensure that initialization of ``*p`` is carried out | 
|  | before ``qatomic_rcu_set()`` makes the data item visible to readers. | 
|  | If this rule is observed, writes will happen in the opposite | 
|  | order as reads in the RCU read-side critical sections (or if | 
|  | there is just one update), and there will be no need for other | 
|  | synchronization mechanism to coordinate the accesses. | 
|  |  | 
|  | The following APIs must be used before RCU is used in a thread: | 
|  |  | 
|  | ``void rcu_register_thread(void);`` | 
|  | Mark a thread as taking part in the RCU mechanism.  Such a thread | 
|  | will have to report quiescent points regularly, either manually | 
|  | or through the ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs. | 
|  |  | 
|  | ``void rcu_unregister_thread(void);`` | 
|  | Mark a thread as not taking part anymore in the RCU mechanism. | 
|  | It is not a problem if such a thread reports quiescent points, | 
|  | either manually or by using the | 
|  | ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs. | 
|  |  | 
|  | Note that these APIs are relatively heavyweight, and should **not** be | 
|  | nested. | 
|  |  | 
|  | Convenience macros | 
|  | ------------------ | 
|  |  | 
|  | Two macros are provided that automatically release the read lock at the | 
|  | end of the scope. | 
|  |  | 
|  | ``RCU_READ_LOCK_GUARD()`` | 
|  | Takes the lock and will release it at the end of the block it's | 
|  | used in. | 
|  |  | 
|  | ``WITH_RCU_READ_LOCK_GUARD()  { code }`` | 
|  | Is used at the head of a block to protect the code within the block. | 
|  |  | 
|  | Note that a ``goto`` out of the guarded block will also drop the lock. | 
|  |  | 
|  | Differences with Linux | 
|  | ---------------------- | 
|  |  | 
|  | - Waiting on a mutex is possible, though discouraged, within an RCU critical | 
|  | section.  This is because spinlocks are rarely (if ever) used in userspace | 
|  | programming; not allowing this would prevent upgrading an RCU read-side | 
|  | critical section to become an updater. | 
|  |  | 
|  | - ``qatomic_rcu_read`` and ``qatomic_rcu_set`` replace ``rcu_dereference`` and | 
|  | ``rcu_assign_pointer``.  They take a **pointer** to the variable being accessed. | 
|  |  | 
|  | - ``call_rcu`` is a macro that has an extra argument (the name of the first | 
|  | field in the struct, which must be a struct ``rcu_head``), and expects the | 
|  | type of the callback's argument to be the type of the first argument. | 
|  | ``call_rcu1`` is the same as Linux's ``call_rcu``. | 
|  |  | 
|  |  | 
|  | RCU Patterns | 
|  | ------------ | 
|  |  | 
|  | Many patterns using read-writer locks translate directly to RCU, with | 
|  | the advantages of higher scalability and deadlock immunity. | 
|  |  | 
|  | In general, RCU can be used whenever it is possible to create a new | 
|  | "version" of a data structure every time the updater runs.  This may | 
|  | sound like a very strict restriction, however: | 
|  |  | 
|  | - the updater does not mean "everything that writes to a data structure", | 
|  | but rather "everything that involves a reclamation step".  See the | 
|  | array example below | 
|  |  | 
|  | - in some cases, creating a new version of a data structure may actually | 
|  | be very cheap.  For example, modifying the "next" pointer of a singly | 
|  | linked list is effectively creating a new version of the list. | 
|  |  | 
|  | Here are some frequently-used RCU idioms that are worth noting. | 
|  |  | 
|  |  | 
|  | RCU list processing | 
|  | ^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | TBD (not yet used in QEMU) | 
|  |  | 
|  |  | 
|  | RCU reference counting | 
|  | ^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Because grace periods are not allowed to complete while there is an RCU | 
|  | read-side critical section in progress, the RCU read-side primitives | 
|  | may be used as a restricted reference-counting mechanism.  For example, | 
|  | consider the following code fragment:: | 
|  |  | 
|  | rcu_read_lock(); | 
|  | p = qatomic_rcu_read(&foo); | 
|  | /* do something with p. */ | 
|  | rcu_read_unlock(); | 
|  |  | 
|  | The RCU read-side critical section ensures that the value of ``p`` remains | 
|  | valid until after the ``rcu_read_unlock()``.  In some sense, it is acquiring | 
|  | a reference to ``p`` that is later released when the critical section ends. | 
|  | The write side looks simply like this (with appropriate locking):: | 
|  |  | 
|  | qemu_mutex_lock(&foo_mutex); | 
|  | old = foo; | 
|  | qatomic_rcu_set(&foo, new); | 
|  | qemu_mutex_unlock(&foo_mutex); | 
|  | synchronize_rcu(); | 
|  | free(old); | 
|  |  | 
|  | If the processing cannot be done purely within the critical section, it | 
|  | is possible to combine this idiom with a "real" reference count:: | 
|  |  | 
|  | rcu_read_lock(); | 
|  | p = qatomic_rcu_read(&foo); | 
|  | foo_ref(p); | 
|  | rcu_read_unlock(); | 
|  | /* do something with p. */ | 
|  | foo_unref(p); | 
|  |  | 
|  | The write side can be like this:: | 
|  |  | 
|  | qemu_mutex_lock(&foo_mutex); | 
|  | old = foo; | 
|  | qatomic_rcu_set(&foo, new); | 
|  | qemu_mutex_unlock(&foo_mutex); | 
|  | synchronize_rcu(); | 
|  | foo_unref(old); | 
|  |  | 
|  | or with ``call_rcu``:: | 
|  |  | 
|  | qemu_mutex_lock(&foo_mutex); | 
|  | old = foo; | 
|  | qatomic_rcu_set(&foo, new); | 
|  | qemu_mutex_unlock(&foo_mutex); | 
|  | call_rcu(foo_unref, old, rcu); | 
|  |  | 
|  | In both cases, the write side only performs removal.  Reclamation | 
|  | happens when the last reference to a ``foo`` object is dropped. | 
|  | Using ``synchronize_rcu()`` is undesirably expensive, because the | 
|  | last reference may be dropped on the read side.  Hence you can | 
|  | use ``call_rcu()`` instead:: | 
|  |  | 
|  | foo_unref(struct foo *p) { | 
|  | if (qatomic_fetch_dec(&p->refcount) == 1) { | 
|  | call_rcu(foo_destroy, p, rcu); | 
|  | } | 
|  | } | 
|  |  | 
|  |  | 
|  | Note that the same idioms would be possible with reader/writer | 
|  | locks:: | 
|  |  | 
|  | read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock); | 
|  | p = foo;                        p = foo; | 
|  | /* do something with p. */      foo = new; | 
|  | read_unlock(&foo_rwlock);       free(p); | 
|  | write_mutex_unlock(&foo_rwlock); | 
|  | free(p); | 
|  |  | 
|  | ------------------------------------------------------------------ | 
|  |  | 
|  | read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock); | 
|  | p = foo;                        old = foo; | 
|  | foo_ref(p);                     foo = new; | 
|  | read_unlock(&foo_rwlock);       foo_unref(old); | 
|  | /* do something with p. */      write_mutex_unlock(&foo_rwlock); | 
|  | read_lock(&foo_rwlock); | 
|  | foo_unref(p); | 
|  | read_unlock(&foo_rwlock); | 
|  |  | 
|  | ``foo_unref`` could use a mechanism such as bottom halves to move deallocation | 
|  | out of the write-side critical section. | 
|  |  | 
|  |  | 
|  | RCU resizable arrays | 
|  | ^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Resizable arrays can be used with RCU.  The expensive RCU synchronization | 
|  | (or ``call_rcu``) only needs to take place when the array is resized. | 
|  | The two items to take care of are: | 
|  |  | 
|  | - ensuring that the old version of the array is available between removal | 
|  | and reclamation; | 
|  |  | 
|  | - avoiding mismatches in the read side between the array data and the | 
|  | array size. | 
|  |  | 
|  | The first problem is avoided simply by not using ``realloc``.  Instead, | 
|  | each resize will allocate a new array and copy the old data into it. | 
|  | The second problem would arise if the size and the data pointers were | 
|  | two members of a larger struct:: | 
|  |  | 
|  | struct mystuff { | 
|  | ... | 
|  | int data_size; | 
|  | int data_alloc; | 
|  | T   *data; | 
|  | ... | 
|  | }; | 
|  |  | 
|  | Instead, we store the size of the array with the array itself:: | 
|  |  | 
|  | struct arr { | 
|  | int size; | 
|  | int alloc; | 
|  | T   data[]; | 
|  | }; | 
|  | struct arr *global_array; | 
|  |  | 
|  | read side: | 
|  | rcu_read_lock(); | 
|  | struct arr *array = qatomic_rcu_read(&global_array); | 
|  | x = i < array->size ? array->data[i] : -1; | 
|  | rcu_read_unlock(); | 
|  | return x; | 
|  |  | 
|  | write side (running under a lock): | 
|  | if (global_array->size == global_array->alloc) { | 
|  | /* Creating a new version.  */ | 
|  | new_array = g_malloc(sizeof(struct arr) + | 
|  | global_array->alloc * 2 * sizeof(T)); | 
|  | new_array->size = global_array->size; | 
|  | new_array->alloc = global_array->alloc * 2; | 
|  | memcpy(new_array->data, global_array->data, | 
|  | global_array->alloc * sizeof(T)); | 
|  |  | 
|  | /* Removal phase.  */ | 
|  | old_array = global_array; | 
|  | qatomic_rcu_set(&global_array, new_array); | 
|  | synchronize_rcu(); | 
|  |  | 
|  | /* Reclamation phase.  */ | 
|  | free(old_array); | 
|  | } | 
|  |  | 
|  |  | 
|  | References | 
|  | ---------- | 
|  |  | 
|  | * The `Linux kernel RCU documentation <https://docs.kernel.org/RCU/>`__ |