Paolo Bonzini | 7911747 | 2013-05-13 13:29:47 +0200 | [diff] [blame] | 1 | Using RCU (Read-Copy-Update) for synchronization |
| 2 | ================================================ |
| 3 | |
| 4 | Read-copy update (RCU) is a synchronization mechanism that is used to |
| 5 | protect read-mostly data structures. RCU is very efficient and scalable |
| 6 | on the read side (it is wait-free), and thus can make the read paths |
| 7 | extremely fast. |
| 8 | |
| 9 | RCU supports concurrency between a single writer and multiple readers, |
| 10 | thus it is not used alone. Typically, the write-side will use a lock to |
| 11 | serialize multiple updates, but other approaches are possible (e.g., |
| 12 | restricting updates to a single task). In QEMU, when a lock is used, |
| 13 | this will often be the "iothread mutex", also known as the "big QEMU |
| 14 | lock" (BQL). Also, restricting updates to a single task is done in |
| 15 | QEMU using the "bottom half" API. |
| 16 | |
| 17 | RCU is fundamentally a "wait-to-finish" mechanism. The read side marks |
| 18 | sections of code with "critical sections", and the update side will wait |
| 19 | for the execution of all *currently running* critical sections before |
| 20 | proceeding, or before asynchronously executing a callback. |
| 21 | |
| 22 | The key point here is that only the currently running critical sections |
| 23 | are waited for; critical sections that are started _after_ the beginning |
| 24 | of the wait do not extend the wait, despite running concurrently with |
| 25 | the updater. This is the reason why RCU is more scalable than, |
| 26 | for example, reader-writer locks. It is so much more scalable that |
| 27 | the system will have a single instance of the RCU mechanism; a single |
| 28 | mechanism can be used for an arbitrary number of "things", without |
| 29 | having to worry about things such as contention or deadlocks. |
| 30 | |
| 31 | How is this possible? The basic idea is to split updates in two phases, |
| 32 | "removal" and "reclamation". During removal, we ensure that subsequent |
| 33 | readers will not be able to get a reference to the old data. After |
| 34 | removal has completed, a critical section will not be able to access |
| 35 | the old data. Therefore, critical sections that begin after removal |
| 36 | do not matter; as soon as all previous critical sections have finished, |
| 37 | there cannot be any readers who hold references to the data structure, |
| 38 | and these can now be safely reclaimed (e.g., freed or unref'ed). |
| 39 | |
| 40 | Here is a picutre: |
| 41 | |
| 42 | thread 1 thread 2 thread 3 |
| 43 | ------------------- ------------------------ ------------------- |
| 44 | enter RCU crit.sec. |
| 45 | | finish removal phase |
| 46 | | begin wait |
| 47 | | | enter RCU crit.sec. |
| 48 | exit RCU crit.sec | | |
| 49 | complete wait | |
| 50 | begin reclamation phase | |
| 51 | exit RCU crit.sec. |
| 52 | |
| 53 | |
| 54 | Note how thread 3 is still executing its critical section when thread 2 |
| 55 | starts reclaiming data. This is possible, because the old version of the |
| 56 | data structure was not accessible at the time thread 3 began executing |
| 57 | that critical section. |
| 58 | |
| 59 | |
| 60 | RCU API |
| 61 | ======= |
| 62 | |
| 63 | The core RCU API is small: |
| 64 | |
| 65 | void rcu_read_lock(void); |
| 66 | |
| 67 | Used by a reader to inform the reclaimer that the reader is |
| 68 | entering an RCU read-side critical section. |
| 69 | |
| 70 | void rcu_read_unlock(void); |
| 71 | |
| 72 | Used by a reader to inform the reclaimer that the reader is |
| 73 | exiting an RCU read-side critical section. Note that RCU |
| 74 | read-side critical sections may be nested and/or overlapping. |
| 75 | |
| 76 | void synchronize_rcu(void); |
| 77 | |
| 78 | Blocks until all pre-existing RCU read-side critical sections |
| 79 | on all threads have completed. This marks the end of the removal |
| 80 | phase and the beginning of reclamation phase. |
| 81 | |
| 82 | Note that it would be valid for another update to come while |
| 83 | synchronize_rcu is running. Because of this, it is better that |
| 84 | the updater releases any locks it may hold before calling |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 85 | synchronize_rcu. If this is not possible (for example, because |
| 86 | the updater is protected by the BQL), you can use call_rcu. |
| 87 | |
| 88 | void call_rcu1(struct rcu_head * head, |
| 89 | void (*func)(struct rcu_head *head)); |
| 90 | |
| 91 | This function invokes func(head) after all pre-existing RCU |
| 92 | read-side critical sections on all threads have completed. This |
| 93 | marks the end of the removal phase, with func taking care |
| 94 | asynchronously of the reclamation phase. |
| 95 | |
| 96 | The foo struct needs to have an rcu_head structure added, |
| 97 | perhaps as follows: |
| 98 | |
| 99 | struct foo { |
| 100 | struct rcu_head rcu; |
| 101 | int a; |
| 102 | char b; |
| 103 | long c; |
| 104 | }; |
| 105 | |
| 106 | so that the reclaimer function can fetch the struct foo address |
| 107 | and free it: |
| 108 | |
| 109 | call_rcu1(&foo.rcu, foo_reclaim); |
| 110 | |
| 111 | void foo_reclaim(struct rcu_head *rp) |
| 112 | { |
| 113 | struct foo *fp = container_of(rp, struct foo, rcu); |
| 114 | g_free(fp); |
| 115 | } |
| 116 | |
| 117 | For the common case where the rcu_head member is the first of the |
| 118 | struct, you can use the following macro. |
| 119 | |
| 120 | void call_rcu(T *p, |
| 121 | void (*func)(T *p), |
| 122 | field-name); |
Paolo Bonzini | 439c5e0 | 2015-02-11 15:00:12 +0100 | [diff] [blame] | 123 | void g_free_rcu(T *p, |
| 124 | field-name); |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 125 | |
Paolo Bonzini | 439c5e0 | 2015-02-11 15:00:12 +0100 | [diff] [blame] | 126 | call_rcu1 is typically used through these macro, in the common case |
| 127 | where the "struct rcu_head" is the first field in the struct. If |
| 128 | the callback function is g_free, in particular, g_free_rcu can be |
| 129 | used. In the above case, one could have written simply: |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 130 | |
Sergey Fedorov | 9bda456 | 2015-10-14 18:46:44 +0300 | [diff] [blame] | 131 | g_free_rcu(&foo, rcu); |
Paolo Bonzini | 7911747 | 2013-05-13 13:29:47 +0200 | [diff] [blame] | 132 | |
| 133 | typeof(*p) atomic_rcu_read(p); |
| 134 | |
| 135 | atomic_rcu_read() is similar to atomic_mb_read(), but it makes |
| 136 | some assumptions on the code that calls it. This allows a more |
| 137 | optimized implementation. |
| 138 | |
| 139 | atomic_rcu_read assumes that whenever a single RCU critical |
| 140 | section reads multiple shared data, these reads are either |
| 141 | data-dependent or need no ordering. This is almost always the |
| 142 | case when using RCU, because read-side critical sections typically |
| 143 | navigate one or more pointers (the pointers that are changed on |
| 144 | every update) until reaching a data structure of interest, |
| 145 | and then read from there. |
| 146 | |
| 147 | RCU read-side critical sections must use atomic_rcu_read() to |
| 148 | read data, unless concurrent writes are presented by another |
| 149 | synchronization mechanism. |
| 150 | |
| 151 | Furthermore, RCU read-side critical sections should traverse the |
| 152 | data structure in a single direction, opposite to the direction |
| 153 | in which the updater initializes it. |
| 154 | |
| 155 | void atomic_rcu_set(p, typeof(*p) v); |
| 156 | |
| 157 | atomic_rcu_set() is also similar to atomic_mb_set(), and it also |
| 158 | makes assumptions on the code that calls it in order to allow a more |
| 159 | optimized implementation. |
| 160 | |
| 161 | In particular, atomic_rcu_set() suffices for synchronization |
| 162 | with readers, if the updater never mutates a field within a |
| 163 | data item that is already accessible to readers. This is the |
| 164 | case when initializing a new copy of the RCU-protected data |
| 165 | structure; just ensure that initialization of *p is carried out |
| 166 | before atomic_rcu_set() makes the data item visible to readers. |
| 167 | If this rule is observed, writes will happen in the opposite |
| 168 | order as reads in the RCU read-side critical sections (or if |
| 169 | there is just one update), and there will be no need for other |
| 170 | synchronization mechanism to coordinate the accesses. |
| 171 | |
| 172 | The following APIs must be used before RCU is used in a thread: |
| 173 | |
| 174 | void rcu_register_thread(void); |
| 175 | |
| 176 | Mark a thread as taking part in the RCU mechanism. Such a thread |
| 177 | will have to report quiescent points regularly, either manually |
| 178 | or through the QemuCond/QemuSemaphore/QemuEvent APIs. |
| 179 | |
| 180 | void rcu_unregister_thread(void); |
| 181 | |
| 182 | Mark a thread as not taking part anymore in the RCU mechanism. |
| 183 | It is not a problem if such a thread reports quiescent points, |
| 184 | either manually or by using the QemuCond/QemuSemaphore/QemuEvent |
| 185 | APIs. |
| 186 | |
| 187 | Note that these APIs are relatively heavyweight, and should _not_ be |
| 188 | nested. |
| 189 | |
| 190 | |
| 191 | DIFFERENCES WITH LINUX |
| 192 | ====================== |
| 193 | |
| 194 | - Waiting on a mutex is possible, though discouraged, within an RCU critical |
| 195 | section. This is because spinlocks are rarely (if ever) used in userspace |
| 196 | programming; not allowing this would prevent upgrading an RCU read-side |
| 197 | critical section to become an updater. |
| 198 | |
| 199 | - atomic_rcu_read and atomic_rcu_set replace rcu_dereference and |
| 200 | rcu_assign_pointer. They take a _pointer_ to the variable being accessed. |
| 201 | |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 202 | - call_rcu is a macro that has an extra argument (the name of the first |
| 203 | field in the struct, which must be a struct rcu_head), and expects the |
| 204 | type of the callback's argument to be the type of the first argument. |
| 205 | call_rcu1 is the same as Linux's call_rcu. |
| 206 | |
Paolo Bonzini | 7911747 | 2013-05-13 13:29:47 +0200 | [diff] [blame] | 207 | |
| 208 | RCU PATTERNS |
| 209 | ============ |
| 210 | |
| 211 | Many patterns using read-writer locks translate directly to RCU, with |
| 212 | the advantages of higher scalability and deadlock immunity. |
| 213 | |
| 214 | In general, RCU can be used whenever it is possible to create a new |
| 215 | "version" of a data structure every time the updater runs. This may |
| 216 | sound like a very strict restriction, however: |
| 217 | |
| 218 | - the updater does not mean "everything that writes to a data structure", |
| 219 | but rather "everything that involves a reclamation step". See the |
| 220 | array example below |
| 221 | |
| 222 | - in some cases, creating a new version of a data structure may actually |
| 223 | be very cheap. For example, modifying the "next" pointer of a singly |
| 224 | linked list is effectively creating a new version of the list. |
| 225 | |
| 226 | Here are some frequently-used RCU idioms that are worth noting. |
| 227 | |
| 228 | |
| 229 | RCU list processing |
| 230 | ------------------- |
| 231 | |
| 232 | TBD (not yet used in QEMU) |
| 233 | |
| 234 | |
| 235 | RCU reference counting |
| 236 | ---------------------- |
| 237 | |
| 238 | Because grace periods are not allowed to complete while there is an RCU |
| 239 | read-side critical section in progress, the RCU read-side primitives |
| 240 | may be used as a restricted reference-counting mechanism. For example, |
| 241 | consider the following code fragment: |
| 242 | |
| 243 | rcu_read_lock(); |
| 244 | p = atomic_rcu_read(&foo); |
| 245 | /* do something with p. */ |
| 246 | rcu_read_unlock(); |
| 247 | |
| 248 | The RCU read-side critical section ensures that the value of "p" remains |
| 249 | valid until after the rcu_read_unlock(). In some sense, it is acquiring |
| 250 | a reference to p that is later released when the critical section ends. |
| 251 | The write side looks simply like this (with appropriate locking): |
| 252 | |
| 253 | qemu_mutex_lock(&foo_mutex); |
| 254 | old = foo; |
| 255 | atomic_rcu_set(&foo, new); |
| 256 | qemu_mutex_unlock(&foo_mutex); |
| 257 | synchronize_rcu(); |
| 258 | free(old); |
| 259 | |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 260 | If the processing cannot be done purely within the critical section, it |
| 261 | is possible to combine this idiom with a "real" reference count: |
| 262 | |
| 263 | rcu_read_lock(); |
| 264 | p = atomic_rcu_read(&foo); |
| 265 | foo_ref(p); |
| 266 | rcu_read_unlock(); |
| 267 | /* do something with p. */ |
| 268 | foo_unref(p); |
| 269 | |
| 270 | The write side can be like this: |
| 271 | |
| 272 | qemu_mutex_lock(&foo_mutex); |
| 273 | old = foo; |
| 274 | atomic_rcu_set(&foo, new); |
| 275 | qemu_mutex_unlock(&foo_mutex); |
| 276 | synchronize_rcu(); |
| 277 | foo_unref(old); |
| 278 | |
| 279 | or with call_rcu: |
| 280 | |
| 281 | qemu_mutex_lock(&foo_mutex); |
| 282 | old = foo; |
| 283 | atomic_rcu_set(&foo, new); |
| 284 | qemu_mutex_unlock(&foo_mutex); |
| 285 | call_rcu(foo_unref, old, rcu); |
| 286 | |
| 287 | In both cases, the write side only performs removal. Reclamation |
| 288 | happens when the last reference to a "foo" object is dropped. |
| 289 | Using synchronize_rcu() is undesirably expensive, because the |
| 290 | last reference may be dropped on the read side. Hence you can |
| 291 | use call_rcu() instead: |
| 292 | |
| 293 | foo_unref(struct foo *p) { |
| 294 | if (atomic_fetch_dec(&p->refcount) == 1) { |
| 295 | call_rcu(foo_destroy, p, rcu); |
| 296 | } |
| 297 | } |
| 298 | |
| 299 | |
| 300 | Note that the same idioms would be possible with reader/writer |
Paolo Bonzini | 7911747 | 2013-05-13 13:29:47 +0200 | [diff] [blame] | 301 | locks: |
| 302 | |
| 303 | read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); |
| 304 | p = foo; p = foo; |
| 305 | /* do something with p. */ foo = new; |
| 306 | read_unlock(&foo_rwlock); free(p); |
| 307 | write_mutex_unlock(&foo_rwlock); |
| 308 | free(p); |
| 309 | |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 310 | ------------------------------------------------------------------ |
| 311 | |
| 312 | read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock); |
| 313 | p = foo; old = foo; |
| 314 | foo_ref(p); foo = new; |
| 315 | read_unlock(&foo_rwlock); foo_unref(old); |
| 316 | /* do something with p. */ write_mutex_unlock(&foo_rwlock); |
| 317 | read_lock(&foo_rwlock); |
| 318 | foo_unref(p); |
| 319 | read_unlock(&foo_rwlock); |
| 320 | |
| 321 | foo_unref could use a mechanism such as bottom halves to move deallocation |
| 322 | out of the write-side critical section. |
| 323 | |
Paolo Bonzini | 7911747 | 2013-05-13 13:29:47 +0200 | [diff] [blame] | 324 | |
| 325 | RCU resizable arrays |
| 326 | -------------------- |
| 327 | |
| 328 | Resizable arrays can be used with RCU. The expensive RCU synchronization |
Paolo Bonzini | 26387f8 | 2013-05-13 17:49:24 +0200 | [diff] [blame] | 329 | (or call_rcu) only needs to take place when the array is resized. |
| 330 | The two items to take care of are: |
Paolo Bonzini | 7911747 | 2013-05-13 13:29:47 +0200 | [diff] [blame] | 331 | |
| 332 | - ensuring that the old version of the array is available between removal |
| 333 | and reclamation; |
| 334 | |
| 335 | - avoiding mismatches in the read side between the array data and the |
| 336 | array size. |
| 337 | |
| 338 | The first problem is avoided simply by not using realloc. Instead, |
| 339 | each resize will allocate a new array and copy the old data into it. |
| 340 | The second problem would arise if the size and the data pointers were |
| 341 | two members of a larger struct: |
| 342 | |
| 343 | struct mystuff { |
| 344 | ... |
| 345 | int data_size; |
| 346 | int data_alloc; |
| 347 | T *data; |
| 348 | ... |
| 349 | }; |
| 350 | |
| 351 | Instead, we store the size of the array with the array itself: |
| 352 | |
| 353 | struct arr { |
| 354 | int size; |
| 355 | int alloc; |
| 356 | T data[]; |
| 357 | }; |
| 358 | struct arr *global_array; |
| 359 | |
| 360 | read side: |
| 361 | rcu_read_lock(); |
| 362 | struct arr *array = atomic_rcu_read(&global_array); |
| 363 | x = i < array->size ? array->data[i] : -1; |
| 364 | rcu_read_unlock(); |
| 365 | return x; |
| 366 | |
| 367 | write side (running under a lock): |
| 368 | if (global_array->size == global_array->alloc) { |
| 369 | /* Creating a new version. */ |
| 370 | new_array = g_malloc(sizeof(struct arr) + |
| 371 | global_array->alloc * 2 * sizeof(T)); |
| 372 | new_array->size = global_array->size; |
| 373 | new_array->alloc = global_array->alloc * 2; |
| 374 | memcpy(new_array->data, global_array->data, |
| 375 | global_array->alloc * sizeof(T)); |
| 376 | |
| 377 | /* Removal phase. */ |
| 378 | old_array = global_array; |
| 379 | atomic_rcu_set(&new_array->data, new_array); |
| 380 | synchronize_rcu(); |
| 381 | |
| 382 | /* Reclamation phase. */ |
| 383 | free(old_array); |
| 384 | } |
| 385 | |
| 386 | |
| 387 | SOURCES |
| 388 | ======= |
| 389 | |
| 390 | * Documentation/RCU/ from the Linux kernel |