| .. _atomics-ref: | 
 |  | 
 | ========================= | 
 | Atomic operations in QEMU | 
 | ========================= | 
 |  | 
 | CPUs perform independent memory operations effectively in random order. | 
 | but this can be a problem for CPU-CPU interaction (including interactions | 
 | between QEMU and the guest).  Multi-threaded programs use various tools | 
 | to instruct the compiler and the CPU to restrict the order to something | 
 | that is consistent with the expectations of the programmer. | 
 |  | 
 | The most basic tool is locking.  Mutexes, condition variables and | 
 | semaphores are used in QEMU, and should be the default approach to | 
 | synchronization.  Anything else is considerably harder, but it's | 
 | also justified more often than one would like; | 
 | the most performance-critical parts of QEMU in particular require | 
 | a very low level approach to concurrency, involving memory barriers | 
 | and atomic operations.  The semantics of concurrent memory accesses are governed | 
 | by the C11 memory model. | 
 |  | 
 | QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to | 
 | provide better portability and a less verbose syntax.  ``qemu/atomic.h`` | 
 | provides macros that fall in three camps: | 
 |  | 
 | - compiler barriers: ``barrier()``; | 
 |  | 
 | - weak atomic access and manual memory barriers: ``qatomic_read()``, | 
 |   ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, | 
 |   ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``, | 
 |   ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``; | 
 |  | 
 | - sequentially consistent atomic access: everything else. | 
 |  | 
 | In general, use of ``qemu/atomic.h`` should be wrapped with more easily | 
 | used data structures (e.g. the lock-free singly-linked list operations | 
 | ``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization | 
 | primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of | 
 | atomic operations and memory barriers should be limited to inter-thread | 
 | checking of flags and documented thoroughly. | 
 |  | 
 |  | 
 |  | 
 | Compiler memory barrier | 
 | ======================= | 
 |  | 
 | ``barrier()`` prevents the compiler from moving the memory accesses on | 
 | either side of it to the other side.  The compiler barrier has no direct | 
 | effect on the CPU, which may then reorder things however it wishes. | 
 |  | 
 | ``barrier()`` is mostly used within ``qemu/atomic.h`` itself.  On some | 
 | architectures, CPU guarantees are strong enough that blocking compiler | 
 | optimizations already ensures the correct order of execution.  In this | 
 | case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple | 
 | compiler barriers. | 
 |  | 
 | Still, ``barrier()`` can be useful when writing code that can be interrupted | 
 | by signal handlers. | 
 |  | 
 |  | 
 | Sequentially consistent atomic access | 
 | ===================================== | 
 |  | 
 | Most of the operations in the ``qemu/atomic.h`` header ensure *sequential | 
 | consistency*, where "the result of any execution is the same as if the | 
 | operations of all the processors were executed in some sequential order, | 
 | and the operations of each individual processor appear in this sequence | 
 | in the order specified by its program". | 
 |  | 
 | ``qemu/atomic.h`` provides the following set of atomic read-modify-write | 
 | operations:: | 
 |  | 
 |     void qatomic_inc(ptr) | 
 |     void qatomic_dec(ptr) | 
 |     void qatomic_add(ptr, val) | 
 |     void qatomic_sub(ptr, val) | 
 |     void qatomic_and(ptr, val) | 
 |     void qatomic_or(ptr, val) | 
 |  | 
 |     typeof(*ptr) qatomic_fetch_inc(ptr) | 
 |     typeof(*ptr) qatomic_fetch_dec(ptr) | 
 |     typeof(*ptr) qatomic_fetch_add(ptr, val) | 
 |     typeof(*ptr) qatomic_fetch_sub(ptr, val) | 
 |     typeof(*ptr) qatomic_fetch_and(ptr, val) | 
 |     typeof(*ptr) qatomic_fetch_or(ptr, val) | 
 |     typeof(*ptr) qatomic_fetch_xor(ptr, val) | 
 |     typeof(*ptr) qatomic_fetch_inc_nonzero(ptr) | 
 |     typeof(*ptr) qatomic_xchg(ptr, val) | 
 |     typeof(*ptr) qatomic_cmpxchg(ptr, old, new) | 
 |  | 
 | all of which return the old value of ``*ptr``.  These operations are | 
 | polymorphic; they operate on any type that is as wide as a pointer or | 
 | smaller. | 
 |  | 
 | Similar operations return the new value of ``*ptr``:: | 
 |  | 
 |     typeof(*ptr) qatomic_inc_fetch(ptr) | 
 |     typeof(*ptr) qatomic_dec_fetch(ptr) | 
 |     typeof(*ptr) qatomic_add_fetch(ptr, val) | 
 |     typeof(*ptr) qatomic_sub_fetch(ptr, val) | 
 |     typeof(*ptr) qatomic_and_fetch(ptr, val) | 
 |     typeof(*ptr) qatomic_or_fetch(ptr, val) | 
 |     typeof(*ptr) qatomic_xor_fetch(ptr, val) | 
 |  | 
 | ``qemu/atomic.h`` also provides an optimized shortcut for | 
 | ``qatomic_set`` followed by ``smp_mb``:: | 
 |  | 
 |     void         qatomic_set_mb(ptr, val) | 
 |  | 
 |  | 
 | Weak atomic access and manual memory barriers | 
 | ============================================= | 
 |  | 
 | Compared to sequentially consistent atomic access, programming with | 
 | weaker consistency models can be considerably more complicated. | 
 | The only guarantees that you can rely upon in this case are: | 
 |  | 
 | - atomic accesses will not cause data races (and hence undefined behavior); | 
 |   ordinary accesses instead cause data races if they are concurrent with | 
 |   other accesses of which at least one is a write.  In order to ensure this, | 
 |   the compiler will not optimize accesses out of existence, create unsolicited | 
 |   accesses, or perform other similar optimzations. | 
 |  | 
 | - acquire operations will appear to happen, with respect to the other | 
 |   components of the system, before all the LOAD or STORE operations | 
 |   specified afterwards. | 
 |  | 
 | - release operations will appear to happen, with respect to the other | 
 |   components of the system, after all the LOAD or STORE operations | 
 |   specified before. | 
 |  | 
 | - release operations will *synchronize with* acquire operations; | 
 |   see :ref:`acqrel` for a detailed explanation. | 
 |  | 
 | When using this model, variables are accessed with: | 
 |  | 
 | - ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from | 
 |   optimizing accesses out of existence and creating unsolicited | 
 |   accesses, but do not otherwise impose any ordering on loads and | 
 |   stores: both the compiler and the processor are free to reorder | 
 |   them. | 
 |  | 
 | - ``qatomic_load_acquire()``, which guarantees the LOAD to appear to | 
 |   happen, with respect to the other components of the system, | 
 |   before all the LOAD or STORE operations specified afterwards. | 
 |   Operations coming before ``qatomic_load_acquire()`` can still be | 
 |   reordered after it. | 
 |  | 
 | - ``qatomic_store_release()``, which guarantees the STORE to appear to | 
 |   happen, with respect to the other components of the system, | 
 |   after all the LOAD or STORE operations specified before. | 
 |   Operations coming after ``qatomic_store_release()`` can still be | 
 |   reordered before it. | 
 |  | 
 | Restrictions to the ordering of accesses can also be specified | 
 | using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, | 
 | ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. | 
 |  | 
 | Memory barriers control the order of references to shared memory. | 
 | They come in six kinds: | 
 |  | 
 | - ``smp_rmb()`` guarantees that all the LOAD operations specified before | 
 |   the barrier will appear to happen before all the LOAD operations | 
 |   specified after the barrier with respect to the other components of | 
 |   the system. | 
 |  | 
 |   In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not | 
 |   required to have any effect on stores. | 
 |  | 
 | - ``smp_wmb()`` guarantees that all the STORE operations specified before | 
 |   the barrier will appear to happen before all the STORE operations | 
 |   specified after the barrier with respect to the other components of | 
 |   the system. | 
 |  | 
 |   In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not | 
 |   required to have any effect on loads. | 
 |  | 
 | - ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before | 
 |   the barrier will appear to happen before all the LOAD or STORE operations | 
 |   specified after the barrier with respect to the other components of | 
 |   the system. | 
 |  | 
 | - ``smp_mb_release()`` guarantees that all the STORE operations specified *after* | 
 |   the barrier will appear to happen after all the LOAD or STORE operations | 
 |   specified *before* the barrier with respect to the other components of | 
 |   the system. | 
 |  | 
 | - ``smp_mb()`` guarantees that all the LOAD and STORE operations specified | 
 |   before the barrier will appear to happen before all the LOAD and | 
 |   STORE operations specified after the barrier with respect to the other | 
 |   components of the system. | 
 |  | 
 |   ``smp_mb()`` puts a partial ordering on both loads and stores.  It is | 
 |   stronger than both a read and a write memory barrier; it implies both | 
 |   ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs | 
 |   coming before the barrier from overtaking LOADs coming after the | 
 |   barrier and vice versa. | 
 |  | 
 | - ``smp_read_barrier_depends()`` is a weaker kind of read barrier.  On | 
 |   most processors, whenever two loads are performed such that the | 
 |   second depends on the result of the first (e.g., the first load | 
 |   retrieves the address to which the second load will be directed), | 
 |   the processor will guarantee that the first LOAD will appear to happen | 
 |   before the second with respect to the other components of the system. | 
 |   Therefore, unlike ``smp_rmb()`` or ``qatomic_load_acquire()``, | 
 |   ``smp_read_barrier_depends()`` can be just a compiler barrier on | 
 |   weakly-ordered architectures such as Arm or PPC[#]_. | 
 |  | 
 |   Note that the first load really has to have a _data_ dependency and not | 
 |   a control dependency.  If the address for the second load is dependent | 
 |   on the first load, but the dependency is through a conditional rather | 
 |   than actually loading the address itself, then it's a _control_ | 
 |   dependency and a full read barrier or better is required. | 
 |  | 
 | .. [#] The DEC Alpha is an exception, because ``smp_read_barrier_depends()`` | 
 |    needs a processor barrier.  On strongly-ordered architectures such | 
 |    as x86 or s390, ``smp_rmb()`` and ``qatomic_load_acquire()`` can | 
 |    also be compiler barriers only. | 
 |  | 
 | Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are | 
 | mostly used when a data structure has one thread that is always a writer | 
 | and one thread that is always a reader: | 
 |  | 
 |     +----------------------------------+----------------------------------+ | 
 |     | thread 1                         | thread 2                         | | 
 |     +==================================+==================================+ | 
 |     | ::                               | ::                               | | 
 |     |                                  |                                  | | 
 |     |   qatomic_store_release(&a, x);  |   y = qatomic_load_acquire(&b);  | | 
 |     |   qatomic_store_release(&b, y);  |   x = qatomic_load_acquire(&a);  | | 
 |     +----------------------------------+----------------------------------+ | 
 |  | 
 | In this case, correctness is easy to check for using the "pairing" | 
 | trick that is explained below. | 
 |  | 
 | Sometimes, a thread is accessing many variables that are otherwise | 
 | unrelated to each other (for example because, apart from the current | 
 | thread, exactly one other thread will read or write each of these | 
 | variables).  In this case, it is possible to "hoist" the barriers | 
 | outside a loop.  For example: | 
 |  | 
 |     +------------------------------------------+----------------------------------+ | 
 |     | before                                   | after                            | | 
 |     +==========================================+==================================+ | 
 |     | ::                                       | ::                               | | 
 |     |                                          |                                  | | 
 |     |   n = 0;                                 |   n = 0;                         | | 
 |     |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       | | 
 |     |     n += qatomic_load_acquire(&a[i]);    |     n += qatomic_read(&a[i]);    | | 
 |     |                                          |   smp_mb_acquire();              | | 
 |     +------------------------------------------+----------------------------------+ | 
 |     | ::                                       | ::                               | | 
 |     |                                          |                                  | | 
 |     |                                          |   smp_mb_release();              | | 
 |     |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       | | 
 |     |     qatomic_store_release(&a[i], false); |     qatomic_set(&a[i], false);   | | 
 |     +------------------------------------------+----------------------------------+ | 
 |  | 
 | Splitting a loop can also be useful to reduce the number of barriers: | 
 |  | 
 |     +------------------------------------------+----------------------------------+ | 
 |     | before                                   | after                            | | 
 |     +==========================================+==================================+ | 
 |     | ::                                       | ::                               | | 
 |     |                                          |                                  | | 
 |     |   n = 0;                                 |     smp_mb_release();            | | 
 |     |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     | | 
 |     |     qatomic_store_release(&a[i], false); |       qatomic_set(&a[i], false); | | 
 |     |     smp_mb();                            |     smb_mb();                    | | 
 |     |     n += qatomic_read(&b[i]);            |     n = 0;                       | | 
 |     |   }                                      |     for (i = 0; i < 10; i++)     | | 
 |     |                                          |       n += qatomic_read(&b[i]);  | | 
 |     +------------------------------------------+----------------------------------+ | 
 |  | 
 | In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer | 
 | as well) ``smp_wmb()``: | 
 |  | 
 |     +------------------------------------------+----------------------------------+ | 
 |     | before                                   | after                            | | 
 |     +==========================================+==================================+ | 
 |     | ::                                       | ::                               | | 
 |     |                                          |                                  | | 
 |     |                                          |     smp_mb_release();            | | 
 |     |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     | | 
 |     |     qatomic_store_release(&a[i], false); |       qatomic_set(&a[i], false); | | 
 |     |     qatomic_store_release(&b[i], false); |     smb_wmb();                   | | 
 |     |   }                                      |     for (i = 0; i < 10; i++)     | | 
 |     |                                          |       qatomic_set(&b[i], false); | | 
 |     +------------------------------------------+----------------------------------+ | 
 |  | 
 |  | 
 | .. _acqrel: | 
 |  | 
 | Acquire/release pairing and the *synchronizes-with* relation | 
 | ------------------------------------------------------------ | 
 |  | 
 | Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have | 
 | either *acquire* or *release* semantics [#rmw]_.  This has two effects: | 
 |  | 
 | .. [#rmw] Read-modify-write operations can have both---acquire applies to the | 
 |           read part, and release to the write. | 
 |  | 
 | - within a thread, they are ordered either before subsequent operations | 
 |   (for acquire) or after previous operations (for release). | 
 |  | 
 | - if a release operation in one thread *synchronizes with* an acquire operation | 
 |   in another thread, the ordering constraints propagates from the first to the | 
 |   second thread.  That is, everything before the release operation in the | 
 |   first thread is guaranteed to *happen before* everything after the | 
 |   acquire operation in the second thread. | 
 |  | 
 | The concept of acquire and release semantics is not exclusive to atomic | 
 | operations; almost all higher-level synchronization primitives also have | 
 | acquire or release semantics.  For example: | 
 |  | 
 | - ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has | 
 |   release semantics and synchronizes with a ``pthread_mutex_lock`` for the | 
 |   same mutex. | 
 |  | 
 | - ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; | 
 |   ``pthread_cond_wait`` has both release semantics (synchronizing with | 
 |   ``pthread_mutex_lock``) and acquire semantics (synchronizing with | 
 |   ``pthread_mutex_unlock`` and signaling of the condition variable). | 
 |  | 
 | - ``pthread_create`` has release semantics and synchronizes with the start | 
 |   of the new thread; ``pthread_join`` has acquire semantics and synchronizes | 
 |   with the exiting of the thread. | 
 |  | 
 | - ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has | 
 |   acquire semantics. | 
 |  | 
 | For example, in the following example there are no atomic accesses, but still | 
 | thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` | 
 | (release) and ``pthread_join`` (acquire): | 
 |  | 
 |       +----------------------+-------------------------------+ | 
 |       | thread 1             | thread 2                      | | 
 |       +======================+===============================+ | 
 |       | ::                   | ::                            | | 
 |       |                      |                               | | 
 |       |   *a = 1;            |                               | | 
 |       |   pthread_exit(a);   |   pthread_join(thread1, &a);  | | 
 |       |                      |   x = *a;                     | | 
 |       +----------------------+-------------------------------+ | 
 |  | 
 | Synchronization between threads basically descends from this pairing of | 
 | a release operation and an acquire operation.  Therefore, atomic operations | 
 | other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be | 
 | paired with another operation of the opposite kind: an acquire operation | 
 | will pair with a release operation and vice versa.  This rule of thumb is | 
 | extremely useful; in the case of QEMU, however, note that the other | 
 | operation may actually be in a driver that runs in the guest! | 
 |  | 
 | ``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, | 
 | ``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count | 
 | as acquire operations.  ``smp_wmb()``, ``smp_mb_release()``, | 
 | ``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release | 
 | operations.  ``smp_mb()`` counts as both acquire and release, therefore | 
 | it can pair with any other atomic operation.  Here is an example: | 
 |  | 
 |       +----------------------+------------------------------+ | 
 |       | thread 1             | thread 2                     | | 
 |       +======================+==============================+ | 
 |       | ::                   | ::                           | | 
 |       |                      |                              | | 
 |       |   qatomic_set(&a, 1);|                              | | 
 |       |   smp_wmb();         |                              | | 
 |       |   qatomic_set(&b, 2);|   x = qatomic_read(&b);      | | 
 |       |                      |   smp_rmb();                 | | 
 |       |                      |   y = qatomic_read(&a);      | | 
 |       +----------------------+------------------------------+ | 
 |  | 
 | Note that a load-store pair only counts if the two operations access the | 
 | same variable: that is, a store-release on a variable ``x`` *synchronizes | 
 | with* a load-acquire on a variable ``x``, while a release barrier | 
 | synchronizes with any acquire operation.  The following example shows | 
 | correct synchronization: | 
 |  | 
 |       +--------------------------------+--------------------------------+ | 
 |       | thread 1                       | thread 2                       | | 
 |       +================================+================================+ | 
 |       | ::                             | ::                             | | 
 |       |                                |                                | | 
 |       |   qatomic_set(&a, 1);          |                                | | 
 |       |   qatomic_store_release(&b, 2);|   x = qatomic_load_acquire(&b);| | 
 |       |                                |   y = qatomic_read(&a);        | | 
 |       +--------------------------------+--------------------------------+ | 
 |  | 
 | Acquire and release semantics of higher-level primitives can also be | 
 | relied upon for the purpose of establishing the *synchronizes with* | 
 | relation. | 
 |  | 
 | Note that the "writing" thread is accessing the variables in the | 
 | opposite order as the "reading" thread.  This is expected: stores | 
 | before a release operation will normally match the loads after | 
 | the acquire operation, and vice versa.  In fact, this happened already | 
 | in the ``pthread_exit``/``pthread_join`` example above. | 
 |  | 
 | Finally, this more complex example has more than two accesses and data | 
 | dependency barriers.  It also does not use atomic accesses whenever there | 
 | cannot be a data race: | 
 |  | 
 |       +----------------------+------------------------------+ | 
 |       | thread 1             | thread 2                     | | 
 |       +======================+==============================+ | 
 |       | ::                   | ::                           | | 
 |       |                      |                              | | 
 |       |   b[2] = 1;          |                              | | 
 |       |   smp_wmb();         |                              | | 
 |       |   x->i = 2;          |                              | | 
 |       |   smp_wmb();         |                              | | 
 |       |   qatomic_set(&a, x);|  x = qatomic_read(&a);       | | 
 |       |                      |  smp_read_barrier_depends(); | | 
 |       |                      |  y = x->i;                   | | 
 |       |                      |  smp_read_barrier_depends(); | | 
 |       |                      |  z = b[y];                   | | 
 |       +----------------------+------------------------------+ | 
 |  | 
 | Comparison with Linux kernel primitives | 
 | ======================================= | 
 |  | 
 | Here is a list of differences between Linux kernel atomic operations | 
 | and memory barriers, and the equivalents in QEMU: | 
 |  | 
 | - atomic operations in Linux are always on a 32-bit int type and | 
 |   use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic | 
 |   and use normal C types. | 
 |  | 
 | - Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee | 
 |   at all. Linux 4.1 updated them to implement volatile | 
 |   semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). | 
 |  | 
 |   QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed | 
 |   semantics if the compiler supports it, and volatile semantics otherwise. | 
 |   Both semantics prevent the compiler from doing certain transformations; | 
 |   the difference is that atomic accesses are guaranteed to be atomic, | 
 |   while volatile accesses aren't. Thus, in the volatile case we just cross | 
 |   our fingers hoping that the compiler will generate atomic accesses, | 
 |   since we assume the variables passed are machine-word sized and | 
 |   properly aligned. | 
 |  | 
 |   No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either | 
 |   Linux or QEMU. | 
 |  | 
 | - atomic read-modify-write operations in Linux are of three kinds: | 
 |  | 
 |          ===================== ========================================= | 
 |          ``atomic_OP``         returns void | 
 |          ``atomic_OP_return``  returns new value of the variable | 
 |          ``atomic_fetch_OP``   returns the old value of the variable | 
 |          ``atomic_cmpxchg``    returns the old value of the variable | 
 |          ===================== ========================================= | 
 |  | 
 |   In QEMU, the second kind is named ``atomic_OP_fetch``. | 
 |  | 
 | - different atomic read-modify-write operations in Linux imply | 
 |   a different set of memory barriers. In QEMU, all of them enforce | 
 |   sequential consistency: there is a single order in which the | 
 |   program sees them happen. | 
 |  | 
 | - however, according to the C11 memory model that QEMU uses, this order | 
 |   does not propagate to other memory accesses on either side of the | 
 |   read-modify-write operation.  As far as those are concerned, the | 
 |   operation consist of just a load-acquire followed by a store-release. | 
 |   Stores that precede the RMW operation, and loads that follow it, can | 
 |   still be reordered and will happen *in the middle* of the read-modify-write | 
 |   operation! | 
 |  | 
 |   Therefore, the following example is correct in Linux but not in QEMU: | 
 |  | 
 |       +----------------------------------+--------------------------------+ | 
 |       | Linux (correct)                  | QEMU (incorrect)               | | 
 |       +==================================+================================+ | 
 |       | ::                               | ::                             | | 
 |       |                                  |                                | | 
 |       |   a = atomic_fetch_add(&x, 2);   |   a = qatomic_fetch_add(&x, 2);| | 
 |       |   b = READ_ONCE(&y);             |   b = qatomic_read(&y);        | | 
 |       +----------------------------------+--------------------------------+ | 
 |  | 
 |   because the read of ``y`` can be moved (by either the processor or the | 
 |   compiler) before the write of ``x``. | 
 |  | 
 |   Fixing this requires a full memory barrier between the write of ``x`` and | 
 |   the read of ``y``.  QEMU provides ``smp_mb__before_rmw()`` and | 
 |   ``smp_mb__after_rmw()``; they act both as an optimization, | 
 |   avoiding the memory barrier on processors where it is unnecessary, | 
 |   and as a clarification of this corner case of the C11 memory model: | 
 |  | 
 |       +--------------------------------+ | 
 |       | QEMU (correct)                 | | 
 |       +================================+ | 
 |       | ::                             | | 
 |       |                                | | 
 |       |   a = qatomic_fetch_add(&x, 2);| | 
 |       |   smp_mb__after_rmw();         | | 
 |       |   b = qatomic_read(&y);        | | 
 |       +--------------------------------+ | 
 |  | 
 |   In the common case where only one thread writes ``x``, it is also possible | 
 |   to write it like this: | 
 |  | 
 |       +--------------------------------+ | 
 |       | QEMU (correct)                 | | 
 |       +================================+ | 
 |       | ::                             | | 
 |       |                                | | 
 |       |   a = qatomic_read(&x);        | | 
 |       |   qatomic_set_mb(&x, a + 2);   | | 
 |       |   b = qatomic_read(&y);        | | 
 |       +--------------------------------+ | 
 |  | 
 | Sources | 
 | ======= | 
 |  | 
 | - ``Documentation/memory-barriers.txt`` from the Linux kernel |