| Using RCU (Read-Copy-Update) for synchronization | 
 | ================================================ | 
 |  | 
 | Read-copy update (RCU) is a synchronization mechanism that is used to | 
 | protect read-mostly data structures.  RCU is very efficient and scalable | 
 | on the read side (it is wait-free), and thus can make the read paths | 
 | extremely fast. | 
 |  | 
 | RCU supports concurrency between a single writer and multiple readers, | 
 | thus it is not used alone.  Typically, the write-side will use a lock to | 
 | serialize multiple updates, but other approaches are possible (e.g., | 
 | restricting updates to a single task).  In QEMU, when a lock is used, | 
 | this will often be the "iothread mutex", also known as the "big QEMU | 
 | lock" (BQL).  Also, restricting updates to a single task is done in | 
 | QEMU using the "bottom half" API. | 
 |  | 
 | RCU is fundamentally a "wait-to-finish" mechanism.  The read side marks | 
 | sections of code with "critical sections", and the update side will wait | 
 | for the execution of all *currently running* critical sections before | 
 | proceeding, or before asynchronously executing a callback. | 
 |  | 
 | The key point here is that only the currently running critical sections | 
 | are waited for; critical sections that are started _after_ the beginning | 
 | of the wait do not extend the wait, despite running concurrently with | 
 | the updater.  This is the reason why RCU is more scalable than, | 
 | for example, reader-writer locks.  It is so much more scalable that | 
 | the system will have a single instance of the RCU mechanism; a single | 
 | mechanism can be used for an arbitrary number of "things", without | 
 | having to worry about things such as contention or deadlocks. | 
 |  | 
 | How is this possible?  The basic idea is to split updates in two phases, | 
 | "removal" and "reclamation".  During removal, we ensure that subsequent | 
 | readers will not be able to get a reference to the old data.  After | 
 | removal has completed, a critical section will not be able to access | 
 | the old data.  Therefore, critical sections that begin after removal | 
 | do not matter; as soon as all previous critical sections have finished, | 
 | there cannot be any readers who hold references to the data structure, | 
 | and these can now be safely reclaimed (e.g., freed or unref'ed). | 
 |  | 
 | Here is a picture: | 
 |  | 
 |         thread 1                  thread 2                  thread 3 | 
 |     -------------------    ------------------------    ------------------- | 
 |     enter RCU crit.sec. | 
 |            |                finish removal phase | 
 |            |                begin wait | 
 |            |                      |                    enter RCU crit.sec. | 
 |     exit RCU crit.sec             |                           | | 
 |                             complete wait                     | | 
 |                             begin reclamation phase           | | 
 |                                                        exit RCU crit.sec. | 
 |  | 
 |  | 
 | Note how thread 3 is still executing its critical section when thread 2 | 
 | starts reclaiming data.  This is possible, because the old version of the | 
 | data structure was not accessible at the time thread 3 began executing | 
 | that critical section. | 
 |  | 
 |  | 
 | RCU API | 
 | ======= | 
 |  | 
 | The core RCU API is small: | 
 |  | 
 |      void rcu_read_lock(void); | 
 |  | 
 |         Used by a reader to inform the reclaimer that the reader is | 
 |         entering an RCU read-side critical section. | 
 |  | 
 |      void rcu_read_unlock(void); | 
 |  | 
 |         Used by a reader to inform the reclaimer that the reader is | 
 |         exiting an RCU read-side critical section.  Note that RCU | 
 |         read-side critical sections may be nested and/or overlapping. | 
 |  | 
 |      void synchronize_rcu(void); | 
 |  | 
 |         Blocks until all pre-existing RCU read-side critical sections | 
 |         on all threads have completed.  This marks the end of the removal | 
 |         phase and the beginning of reclamation phase. | 
 |  | 
 |         Note that it would be valid for another update to come while | 
 |         synchronize_rcu is running.  Because of this, it is better that | 
 |         the updater releases any locks it may hold before calling | 
 |         synchronize_rcu.  If this is not possible (for example, because | 
 |         the updater is protected by the BQL), you can use call_rcu. | 
 |  | 
 |      void call_rcu1(struct rcu_head * head, | 
 |                     void (*func)(struct rcu_head *head)); | 
 |  | 
 |         This function invokes func(head) after all pre-existing RCU | 
 |         read-side critical sections on all threads have completed.  This | 
 |         marks the end of the removal phase, with func taking care | 
 |         asynchronously of the reclamation phase. | 
 |  | 
 |         The foo struct needs to have an rcu_head structure added, | 
 |         perhaps as follows: | 
 |  | 
 |             struct foo { | 
 |                 struct rcu_head rcu; | 
 |                 int a; | 
 |                 char b; | 
 |                 long c; | 
 |             }; | 
 |  | 
 |         so that the reclaimer function can fetch the struct foo address | 
 |         and free it: | 
 |  | 
 |             call_rcu1(&foo.rcu, foo_reclaim); | 
 |  | 
 |             void foo_reclaim(struct rcu_head *rp) | 
 |             { | 
 |                 struct foo *fp = container_of(rp, struct foo, rcu); | 
 |                 g_free(fp); | 
 |             } | 
 |  | 
 |         For the common case where the rcu_head member is the first of the | 
 |         struct, you can use the following macro. | 
 |  | 
 |      void call_rcu(T *p, | 
 |                    void (*func)(T *p), | 
 |                    field-name); | 
 |      void g_free_rcu(T *p, | 
 |                      field-name); | 
 |  | 
 |         call_rcu1 is typically used through these macro, in the common case | 
 |         where the "struct rcu_head" is the first field in the struct.  If | 
 |         the callback function is g_free, in particular, g_free_rcu can be | 
 |         used.  In the above case, one could have written simply: | 
 |  | 
 |             g_free_rcu(&foo, rcu); | 
 |  | 
 |      typeof(*p) qatomic_rcu_read(p); | 
 |  | 
 |         qatomic_rcu_read() is similar to qatomic_load_acquire(), but it makes | 
 |         some assumptions on the code that calls it.  This allows a more | 
 |         optimized implementation. | 
 |  | 
 |         qatomic_rcu_read assumes that whenever a single RCU critical | 
 |         section reads multiple shared data, these reads are either | 
 |         data-dependent or need no ordering.  This is almost always the | 
 |         case when using RCU, because read-side critical sections typically | 
 |         navigate one or more pointers (the pointers that are changed on | 
 |         every update) until reaching a data structure of interest, | 
 |         and then read from there. | 
 |  | 
 |         RCU read-side critical sections must use qatomic_rcu_read() to | 
 |         read data, unless concurrent writes are prevented by another | 
 |         synchronization mechanism. | 
 |  | 
 |         Furthermore, RCU read-side critical sections should traverse the | 
 |         data structure in a single direction, opposite to the direction | 
 |         in which the updater initializes it. | 
 |  | 
 |      void qatomic_rcu_set(p, typeof(*p) v); | 
 |  | 
 |         qatomic_rcu_set() is similar to qatomic_store_release(), though it also | 
 |         makes assumptions on the code that calls it in order to allow a more | 
 |         optimized implementation. | 
 |  | 
 |         In particular, qatomic_rcu_set() suffices for synchronization | 
 |         with readers, if the updater never mutates a field within a | 
 |         data item that is already accessible to readers.  This is the | 
 |         case when initializing a new copy of the RCU-protected data | 
 |         structure; just ensure that initialization of *p is carried out | 
 |         before qatomic_rcu_set() makes the data item visible to readers. | 
 |         If this rule is observed, writes will happen in the opposite | 
 |         order as reads in the RCU read-side critical sections (or if | 
 |         there is just one update), and there will be no need for other | 
 |         synchronization mechanism to coordinate the accesses. | 
 |  | 
 | The following APIs must be used before RCU is used in a thread: | 
 |  | 
 |      void rcu_register_thread(void); | 
 |  | 
 |         Mark a thread as taking part in the RCU mechanism.  Such a thread | 
 |         will have to report quiescent points regularly, either manually | 
 |         or through the QemuCond/QemuSemaphore/QemuEvent APIs. | 
 |  | 
 |      void rcu_unregister_thread(void); | 
 |  | 
 |         Mark a thread as not taking part anymore in the RCU mechanism. | 
 |         It is not a problem if such a thread reports quiescent points, | 
 |         either manually or by using the QemuCond/QemuSemaphore/QemuEvent | 
 |         APIs. | 
 |  | 
 | Note that these APIs are relatively heavyweight, and should _not_ be | 
 | nested. | 
 |  | 
 | Convenience macros | 
 | ================== | 
 |  | 
 | Two macros are provided that automatically release the read lock at the | 
 | end of the scope. | 
 |  | 
 |       RCU_READ_LOCK_GUARD() | 
 |  | 
 |          Takes the lock and will release it at the end of the block it's | 
 |          used in. | 
 |  | 
 |       WITH_RCU_READ_LOCK_GUARD()  { code } | 
 |  | 
 |          Is used at the head of a block to protect the code within the block. | 
 |  | 
 | Note that 'goto'ing out of the guarded block will also drop the lock. | 
 |  | 
 | DIFFERENCES WITH LINUX | 
 | ====================== | 
 |  | 
 | - Waiting on a mutex is possible, though discouraged, within an RCU critical | 
 |   section.  This is because spinlocks are rarely (if ever) used in userspace | 
 |   programming; not allowing this would prevent upgrading an RCU read-side | 
 |   critical section to become an updater. | 
 |  | 
 | - qatomic_rcu_read and qatomic_rcu_set replace rcu_dereference and | 
 |   rcu_assign_pointer.  They take a _pointer_ to the variable being accessed. | 
 |  | 
 | - call_rcu is a macro that has an extra argument (the name of the first | 
 |   field in the struct, which must be a struct rcu_head), and expects the | 
 |   type of the callback's argument to be the type of the first argument. | 
 |   call_rcu1 is the same as Linux's call_rcu. | 
 |  | 
 |  | 
 | RCU PATTERNS | 
 | ============ | 
 |  | 
 | Many patterns using read-writer locks translate directly to RCU, with | 
 | the advantages of higher scalability and deadlock immunity. | 
 |  | 
 | In general, RCU can be used whenever it is possible to create a new | 
 | "version" of a data structure every time the updater runs.  This may | 
 | sound like a very strict restriction, however: | 
 |  | 
 | - the updater does not mean "everything that writes to a data structure", | 
 |   but rather "everything that involves a reclamation step".  See the | 
 |   array example below | 
 |  | 
 | - in some cases, creating a new version of a data structure may actually | 
 |   be very cheap.  For example, modifying the "next" pointer of a singly | 
 |   linked list is effectively creating a new version of the list. | 
 |  | 
 | Here are some frequently-used RCU idioms that are worth noting. | 
 |  | 
 |  | 
 | RCU list processing | 
 | ------------------- | 
 |  | 
 | TBD (not yet used in QEMU) | 
 |  | 
 |  | 
 | RCU reference counting | 
 | ---------------------- | 
 |  | 
 | Because grace periods are not allowed to complete while there is an RCU | 
 | read-side critical section in progress, the RCU read-side primitives | 
 | may be used as a restricted reference-counting mechanism.  For example, | 
 | consider the following code fragment: | 
 |  | 
 |     rcu_read_lock(); | 
 |     p = qatomic_rcu_read(&foo); | 
 |     /* do something with p. */ | 
 |     rcu_read_unlock(); | 
 |  | 
 | The RCU read-side critical section ensures that the value of "p" remains | 
 | valid until after the rcu_read_unlock().  In some sense, it is acquiring | 
 | a reference to p that is later released when the critical section ends. | 
 | The write side looks simply like this (with appropriate locking): | 
 |  | 
 |     qemu_mutex_lock(&foo_mutex); | 
 |     old = foo; | 
 |     qatomic_rcu_set(&foo, new); | 
 |     qemu_mutex_unlock(&foo_mutex); | 
 |     synchronize_rcu(); | 
 |     free(old); | 
 |  | 
 | If the processing cannot be done purely within the critical section, it | 
 | is possible to combine this idiom with a "real" reference count: | 
 |  | 
 |     rcu_read_lock(); | 
 |     p = qatomic_rcu_read(&foo); | 
 |     foo_ref(p); | 
 |     rcu_read_unlock(); | 
 |     /* do something with p. */ | 
 |     foo_unref(p); | 
 |  | 
 | The write side can be like this: | 
 |  | 
 |     qemu_mutex_lock(&foo_mutex); | 
 |     old = foo; | 
 |     qatomic_rcu_set(&foo, new); | 
 |     qemu_mutex_unlock(&foo_mutex); | 
 |     synchronize_rcu(); | 
 |     foo_unref(old); | 
 |  | 
 | or with call_rcu: | 
 |  | 
 |     qemu_mutex_lock(&foo_mutex); | 
 |     old = foo; | 
 |     qatomic_rcu_set(&foo, new); | 
 |     qemu_mutex_unlock(&foo_mutex); | 
 |     call_rcu(foo_unref, old, rcu); | 
 |  | 
 | In both cases, the write side only performs removal.  Reclamation | 
 | happens when the last reference to a "foo" object is dropped. | 
 | Using synchronize_rcu() is undesirably expensive, because the | 
 | last reference may be dropped on the read side.  Hence you can | 
 | use call_rcu() instead: | 
 |  | 
 |      foo_unref(struct foo *p) { | 
 |         if (qatomic_fetch_dec(&p->refcount) == 1) { | 
 |             call_rcu(foo_destroy, p, rcu); | 
 |         } | 
 |     } | 
 |  | 
 |  | 
 | Note that the same idioms would be possible with reader/writer | 
 | locks: | 
 |  | 
 |     read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock); | 
 |     p = foo;                        p = foo; | 
 |     /* do something with p. */      foo = new; | 
 |     read_unlock(&foo_rwlock);       free(p); | 
 |                                     write_mutex_unlock(&foo_rwlock); | 
 |                                     free(p); | 
 |  | 
 |     ------------------------------------------------------------------ | 
 |  | 
 |     read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock); | 
 |     p = foo;                        old = foo; | 
 |     foo_ref(p);                     foo = new; | 
 |     read_unlock(&foo_rwlock);       foo_unref(old); | 
 |     /* do something with p. */      write_mutex_unlock(&foo_rwlock); | 
 |     read_lock(&foo_rwlock); | 
 |     foo_unref(p); | 
 |     read_unlock(&foo_rwlock); | 
 |  | 
 | foo_unref could use a mechanism such as bottom halves to move deallocation | 
 | out of the write-side critical section. | 
 |  | 
 |  | 
 | RCU resizable arrays | 
 | -------------------- | 
 |  | 
 | Resizable arrays can be used with RCU.  The expensive RCU synchronization | 
 | (or call_rcu) only needs to take place when the array is resized. | 
 | The two items to take care of are: | 
 |  | 
 | - ensuring that the old version of the array is available between removal | 
 |   and reclamation; | 
 |  | 
 | - avoiding mismatches in the read side between the array data and the | 
 |   array size. | 
 |  | 
 | The first problem is avoided simply by not using realloc.  Instead, | 
 | each resize will allocate a new array and copy the old data into it. | 
 | The second problem would arise if the size and the data pointers were | 
 | two members of a larger struct: | 
 |  | 
 |     struct mystuff { | 
 |         ... | 
 |         int data_size; | 
 |         int data_alloc; | 
 |         T   *data; | 
 |         ... | 
 |     }; | 
 |  | 
 | Instead, we store the size of the array with the array itself: | 
 |  | 
 |     struct arr { | 
 |         int size; | 
 |         int alloc; | 
 |         T   data[]; | 
 |     }; | 
 |     struct arr *global_array; | 
 |  | 
 |     read side: | 
 |         rcu_read_lock(); | 
 |         struct arr *array = qatomic_rcu_read(&global_array); | 
 |         x = i < array->size ? array->data[i] : -1; | 
 |         rcu_read_unlock(); | 
 |         return x; | 
 |  | 
 |     write side (running under a lock): | 
 |         if (global_array->size == global_array->alloc) { | 
 |             /* Creating a new version.  */ | 
 |             new_array = g_malloc(sizeof(struct arr) + | 
 |                                  global_array->alloc * 2 * sizeof(T)); | 
 |             new_array->size = global_array->size; | 
 |             new_array->alloc = global_array->alloc * 2; | 
 |             memcpy(new_array->data, global_array->data, | 
 |                    global_array->alloc * sizeof(T)); | 
 |  | 
 |             /* Removal phase.  */ | 
 |             old_array = global_array; | 
 |             qatomic_rcu_set(&global_array, new_array); | 
 |             synchronize_rcu(); | 
 |  | 
 |             /* Reclamation phase.  */ | 
 |             free(old_array); | 
 |         } | 
 |  | 
 |  | 
 | SOURCES | 
 | ======= | 
 |  | 
 | * Documentation/RCU/ from the Linux kernel |