docs/devel/lockcnt.rst - qemu - Git at Google

 Locked Counters (aka ``QemuLockCnt``)
 =====================================

 QEMU often uses reference counts to track data structures that are being
 accessed and should not be freed.  For example, a loop that invoke
 callbacks like this is not safe::

     QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
         if (ioh->revents & G_IO_OUT) {
             ioh->fd_write(ioh->opaque);
         }
     }

 ``QLIST_FOREACH_SAFE`` protects against deletion of the current node (``ioh``)
 by stashing away its ``next`` pointer.  However, ``ioh->fd_write`` could
 actually delete the next node from the list.  The simplest way to
 avoid this is to mark the node as deleted, and remove it from the
 list in the above loop::

     QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
         if (ioh->deleted) {
             QLIST_REMOVE(ioh, next);
             g_free(ioh);
         } else {
             if (ioh->revents & G_IO_OUT) {
                 ioh->fd_write(ioh->opaque);
             }
         }
     }

 If however this loop must also be reentrant, i.e. it is possible that
 ``ioh->fd_write`` invokes the loop again, some kind of counting is needed::

     walking_handlers++;
     QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
         if (ioh->deleted) {
             if (walking_handlers == 1) {
                 QLIST_REMOVE(ioh, next);
                 g_free(ioh);
             }
         } else {
             if (ioh->revents & G_IO_OUT) {
                 ioh->fd_write(ioh->opaque);
             }
         }
     }
     walking_handlers--;

 One may think of using the RCU primitives, ``rcu_read_lock()`` and
 ``rcu_read_unlock()``; effectively, the RCU nesting count would take
 the place of the walking_handlers global variable.  Indeed,
 reference counting and RCU have similar purposes, but their usage in
 general is complementary:

 - reference counting is fine-grained and limited to a single data
   structure; RCU delays reclamation of *all* RCU-protected data
   structures;

 - reference counting works even in the presence of code that keeps
   a reference for a long time; RCU critical sections in principle
   should be kept short;

 - reference counting is often applied to code that is not thread-safe
   but is reentrant; in fact, usage of reference counting in QEMU predates
   the introduction of threads by many years.  RCU is generally used to
   protect readers from other threads freeing memory after concurrent
   modifications to a data structure.

 - reclaiming data can be done by a separate thread in the case of RCU;
   this can improve performance, but also delay reclamation undesirably.
   With reference counting, reclamation is deterministic.

 This file documents ``QemuLockCnt``, an abstraction for using reference
 counting in code that has to be both thread-safe and reentrant.


 ``QemuLockCnt`` concepts
 ------------------------

 A ``QemuLockCnt`` comprises both a counter and a mutex; it has primitives
 to increment and decrement the counter, and to take and release the
 mutex.  The counter notes how many visits to the data structures are
 taking place (the visits could be from different threads, or there could
 be multiple reentrant visits from the same thread).  The basic rules
 governing the counter/mutex pair then are the following:

 - Data protected by the QemuLockCnt must not be freed unless the
   counter is zero and the mutex is taken.

 - A new visit cannot be started while the counter is zero and the
   mutex is taken.

 Most of the time, the mutex protects all writes to the data structure,
 not just frees, though there could be cases where this is not necessary.

 Reads, instead, can be done without taking the mutex, as long as the
 readers and writers use the same macros that are used for RCU, for
 example ``qatomic_rcu_read``, ``qatomic_rcu_set``, ``QLIST_FOREACH_RCU``,
 etc.  This is because the reads are done outside a lock and a set
 or ``QLIST_INSERT_HEAD``
 can happen concurrently with the read.  The RCU API ensures that the
 processor and the compiler see all required memory barriers.

 This could be implemented simply by protecting the counter with the
 mutex, for example::

     // (1)
     qemu_mutex_lock(&walking_handlers_mutex);
     walking_handlers++;
     qemu_mutex_unlock(&walking_handlers_mutex);

     ...

     // (2)
     qemu_mutex_lock(&walking_handlers_mutex);
     if (--walking_handlers == 0) {
         QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
             if (ioh->deleted) {
                 QLIST_REMOVE(ioh, next);
                 g_free(ioh);
             }
         }
     }
     qemu_mutex_unlock(&walking_handlers_mutex);

 Here, no frees can happen in the code represented by the ellipsis.
 If another thread is executing critical section (2), that part of
 the code cannot be entered, because the thread will not be able
 to increment the ``walking_handlers`` variable.  And of course
 during the visit any other thread will see a nonzero value for
 ``walking_handlers``, as in the single-threaded code.

 Note that it is possible for multiple concurrent accesses to delay
 the cleanup arbitrarily; in other words, for the ``walking_handlers``
 counter to never become zero.  For this reason, this technique is
 more easily applicable if concurrent access to the structure is rare.

 However, critical sections are easy to forget since you have to do
 them for each modification of the counter.  ``QemuLockCnt`` ensures that
 all modifications of the counter take the lock appropriately, and it
 can also be more efficient in two ways:

 - it avoids taking the lock for many operations (for example
   incrementing the counter while it is non-zero);

 - on some platforms, one can implement ``QemuLockCnt`` to hold the lock
   and the mutex in a single word, making the fast path no more expensive
   than simply managing a counter using atomic operations (see
   :doc:`atomics`).  This can be very helpful if concurrent access to
   the data structure is expected to be rare.


 Using the same mutex for frees and writes can still incur some small
 inefficiencies; for example, a visit can never start if the counter is
 zero and the mutex is taken -- even if the mutex is taken by a write,
 which in principle need not block a visit of the data structure.
 However, these are usually not a problem if any of the following
 assumptions are valid:

 - concurrent access is possible but rare

 - writes are rare

 - writes are frequent, but this kind of write (e.g. appending to a
   list) has a very small critical section.

 For example, QEMU uses ``QemuLockCnt`` to manage an ``AioContext``'s list of
 bottom halves and file descriptor handlers.  Modifications to the list
 of file descriptor handlers are rare.  Creation of a new bottom half is
 frequent and can happen on a fast path; however: 1) it is almost never
 concurrent with a visit to the list of bottom halves; 2) it only has
 three instructions in the critical path, two assignments and a ``smp_wmb()``.


 ``QemuLockCnt`` API
 -------------------

 .. kernel-doc:: include/qemu/lockcnt.h


 ``QemuLockCnt`` usage
 ---------------------

 This section explains the typical usage patterns for ``QemuLockCnt`` functions.

 Setting a variable to a non-NULL value can be done between
 ``qemu_lockcnt_lock`` and ``qemu_lockcnt_unlock``::

     qemu_lockcnt_lock(&xyz_lockcnt);
     if (!xyz) {
         new_xyz = g_new(XYZ, 1);
         ...
         qatomic_rcu_set(&xyz, new_xyz);
     }
     qemu_lockcnt_unlock(&xyz_lockcnt);

 Accessing the value can be done between ``qemu_lockcnt_inc`` and
 ``qemu_lockcnt_dec``::

     qemu_lockcnt_inc(&xyz_lockcnt);
     if (xyz) {
         XYZ *p = qatomic_rcu_read(&xyz);
         ...
         /* Accesses can now be done through "p".  */
     }
     qemu_lockcnt_dec(&xyz_lockcnt);

 Freeing the object can similarly use ``qemu_lockcnt_lock`` and
 ``qemu_lockcnt_unlock``, but you also need to ensure that the count
 is zero (i.e. there is no concurrent visit).  Because ``qemu_lockcnt_inc``
 takes the ``QemuLockCnt``'s lock, the count cannot become non-zero while
 the object is being freed.  Freeing an object looks like this::

     qemu_lockcnt_lock(&xyz_lockcnt);
     if (!qemu_lockcnt_count(&xyz_lockcnt)) {
         g_free(xyz);
         xyz = NULL;
     }
     qemu_lockcnt_unlock(&xyz_lockcnt);

 If an object has to be freed right after a visit, you can combine
 the decrement, the locking and the check on count as follows::

     qemu_lockcnt_inc(&xyz_lockcnt);
     if (xyz) {
         XYZ *p = qatomic_rcu_read(&xyz);
         ...
         /* Accesses can now be done through "p".  */
     }
     if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) {
         g_free(xyz);
         xyz = NULL;
         qemu_lockcnt_unlock(&xyz_lockcnt);
     }

 ``QemuLockCnt`` can also be used to access a list as follows::

     qemu_lockcnt_inc(&io_handlers_lockcnt);
     QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) {
         if (ioh->revents & G_IO_OUT) {
             ioh->fd_write(ioh->opaque);
         }
     }

     if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) {
         QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
             if (ioh->deleted) {
                 QLIST_REMOVE(ioh, next);
                 g_free(ioh);
             }
         }
         qemu_lockcnt_unlock(&io_handlers_lockcnt);
     }

 Again, the RCU primitives are used because new items can be added to the
 list during the walk.  ``QLIST_FOREACH_RCU`` ensures that the processor and
 the compiler see the appropriate memory barriers.

 An alternative pattern uses ``qemu_lockcnt_dec_if_lock``::

     qemu_lockcnt_inc(&io_handlers_lockcnt);
     QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) {
         if (ioh->deleted) {
             if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) {
                 QLIST_REMOVE(ioh, next);
                 g_free(ioh);
                 qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt);
             }
         } else {
             if (ioh->revents & G_IO_OUT) {
                 ioh->fd_write(ioh->opaque);
             }
         }
     }
     qemu_lockcnt_dec(&io_handlers_lockcnt);

 Here you can use ``qemu_lockcnt_dec`` instead of ``qemu_lockcnt_dec_and_lock``,
 because there is no special task to do if the count goes from 1 to 0.
	Locked Counters (aka ``QemuLockCnt``)
	=====================================

	QEMU often uses reference counts to track data structures that are being
	accessed and should not be freed. For example, a loop that invoke
	callbacks like this is not safe::

	QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
	if (ioh->revents & G_IO_OUT) {
	ioh->fd_write(ioh->opaque);
	}
	}

	``QLIST_FOREACH_SAFE`` protects against deletion of the current node (``ioh``)
	by stashing away its ``next`` pointer. However, ``ioh->fd_write`` could
	actually delete the next node from the list. The simplest way to
	avoid this is to mark the node as deleted, and remove it from the
	list in the above loop::

	QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
	if (ioh->deleted) {
	QLIST_REMOVE(ioh, next);
	g_free(ioh);
	} else {
	if (ioh->revents & G_IO_OUT) {
	ioh->fd_write(ioh->opaque);
	}
	}
	}

	If however this loop must also be reentrant, i.e. it is possible that
	``ioh->fd_write`` invokes the loop again, some kind of counting is needed::

	walking_handlers++;
	QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
	if (ioh->deleted) {
	if (walking_handlers == 1) {
	QLIST_REMOVE(ioh, next);
	g_free(ioh);
	}
	} else {
	if (ioh->revents & G_IO_OUT) {
	ioh->fd_write(ioh->opaque);
	}
	}
	}
	walking_handlers--;

	One may think of using the RCU primitives, ``rcu_read_lock()`` and
	``rcu_read_unlock()``; effectively, the RCU nesting count would take
	the place of the walking_handlers global variable. Indeed,
	reference counting and RCU have similar purposes, but their usage in
	general is complementary:

	- reference counting is fine-grained and limited to a single data
	structure; RCU delays reclamation of all RCU-protected data
	structures;

	- reference counting works even in the presence of code that keeps
	a reference for a long time; RCU critical sections in principle
	should be kept short;

	- reference counting is often applied to code that is not thread-safe
	but is reentrant; in fact, usage of reference counting in QEMU predates
	the introduction of threads by many years. RCU is generally used to
	protect readers from other threads freeing memory after concurrent
	modifications to a data structure.

	- reclaiming data can be done by a separate thread in the case of RCU;
	this can improve performance, but also delay reclamation undesirably.
	With reference counting, reclamation is deterministic.

	This file documents ``QemuLockCnt``, an abstraction for using reference
	counting in code that has to be both thread-safe and reentrant.


	``QemuLockCnt`` concepts
	------------------------

	A ``QemuLockCnt`` comprises both a counter and a mutex; it has primitives
	to increment and decrement the counter, and to take and release the
	mutex. The counter notes how many visits to the data structures are
	taking place (the visits could be from different threads, or there could
	be multiple reentrant visits from the same thread). The basic rules
	governing the counter/mutex pair then are the following:

	- Data protected by the QemuLockCnt must not be freed unless the
	counter is zero and the mutex is taken.

	- A new visit cannot be started while the counter is zero and the
	mutex is taken.

	Most of the time, the mutex protects all writes to the data structure,
	not just frees, though there could be cases where this is not necessary.

	Reads, instead, can be done without taking the mutex, as long as the
	readers and writers use the same macros that are used for RCU, for
	example ``qatomic_rcu_read``, ``qatomic_rcu_set``, ``QLIST_FOREACH_RCU``,
	etc. This is because the reads are done outside a lock and a set
	or ``QLIST_INSERT_HEAD``
	can happen concurrently with the read. The RCU API ensures that the
	processor and the compiler see all required memory barriers.

	This could be implemented simply by protecting the counter with the
	mutex, for example::

	// (1)
	qemu_mutex_lock(&walking_handlers_mutex);
	walking_handlers++;
	qemu_mutex_unlock(&walking_handlers_mutex);

	...

	// (2)
	qemu_mutex_lock(&walking_handlers_mutex);
	if (--walking_handlers == 0) {
	QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
	if (ioh->deleted) {
	QLIST_REMOVE(ioh, next);
	g_free(ioh);
	}
	}
	}
	qemu_mutex_unlock(&walking_handlers_mutex);

	Here, no frees can happen in the code represented by the ellipsis.
	If another thread is executing critical section (2), that part of
	the code cannot be entered, because the thread will not be able
	to increment the ``walking_handlers`` variable. And of course
	during the visit any other thread will see a nonzero value for
	``walking_handlers``, as in the single-threaded code.

	Note that it is possible for multiple concurrent accesses to delay
	the cleanup arbitrarily; in other words, for the ``walking_handlers``
	counter to never become zero. For this reason, this technique is
	more easily applicable if concurrent access to the structure is rare.

	However, critical sections are easy to forget since you have to do
	them for each modification of the counter. ``QemuLockCnt`` ensures that
	all modifications of the counter take the lock appropriately, and it
	can also be more efficient in two ways:

	- it avoids taking the lock for many operations (for example
	incrementing the counter while it is non-zero);

	- on some platforms, one can implement ``QemuLockCnt`` to hold the lock
	and the mutex in a single word, making the fast path no more expensive
	than simply managing a counter using atomic operations (see
	:doc:`atomics`). This can be very helpful if concurrent access to
	the data structure is expected to be rare.


	Using the same mutex for frees and writes can still incur some small
	inefficiencies; for example, a visit can never start if the counter is
	zero and the mutex is taken -- even if the mutex is taken by a write,
	which in principle need not block a visit of the data structure.
	However, these are usually not a problem if any of the following
	assumptions are valid:

	- concurrent access is possible but rare

	- writes are rare

	- writes are frequent, but this kind of write (e.g. appending to a
	list) has a very small critical section.

	For example, QEMU uses ``QemuLockCnt`` to manage an ``AioContext``'s list of
	bottom halves and file descriptor handlers. Modifications to the list
	of file descriptor handlers are rare. Creation of a new bottom half is
	frequent and can happen on a fast path; however: 1) it is almost never
	concurrent with a visit to the list of bottom halves; 2) it only has
	three instructions in the critical path, two assignments and a ``smp_wmb()``.


	``QemuLockCnt`` API
	-------------------

	.. kernel-doc:: include/qemu/lockcnt.h


	``QemuLockCnt`` usage
	---------------------

	This section explains the typical usage patterns for ``QemuLockCnt`` functions.

	Setting a variable to a non-NULL value can be done between
	``qemu_lockcnt_lock`` and ``qemu_lockcnt_unlock``::

	qemu_lockcnt_lock(&xyz_lockcnt);
	if (!xyz) {
	new_xyz = g_new(XYZ, 1);
	...
	qatomic_rcu_set(&xyz, new_xyz);
	}
	qemu_lockcnt_unlock(&xyz_lockcnt);

	Accessing the value can be done between ``qemu_lockcnt_inc`` and
	``qemu_lockcnt_dec``::

	qemu_lockcnt_inc(&xyz_lockcnt);
	if (xyz) {
	XYZ *p = qatomic_rcu_read(&xyz);
	...
	/* Accesses can now be done through "p". */
	}
	qemu_lockcnt_dec(&xyz_lockcnt);

	Freeing the object can similarly use ``qemu_lockcnt_lock`` and
	``qemu_lockcnt_unlock``, but you also need to ensure that the count
	is zero (i.e. there is no concurrent visit). Because ``qemu_lockcnt_inc``
	takes the ``QemuLockCnt``'s lock, the count cannot become non-zero while
	the object is being freed. Freeing an object looks like this::

	qemu_lockcnt_lock(&xyz_lockcnt);
	if (!qemu_lockcnt_count(&xyz_lockcnt)) {
	g_free(xyz);
	xyz = NULL;
	}
	qemu_lockcnt_unlock(&xyz_lockcnt);

	If an object has to be freed right after a visit, you can combine
	the decrement, the locking and the check on count as follows::

	qemu_lockcnt_inc(&xyz_lockcnt);
	if (xyz) {
	XYZ *p = qatomic_rcu_read(&xyz);
	...
	/* Accesses can now be done through "p". */
	}
	if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) {
	g_free(xyz);
	xyz = NULL;
	qemu_lockcnt_unlock(&xyz_lockcnt);
	}

	``QemuLockCnt`` can also be used to access a list as follows::

	qemu_lockcnt_inc(&io_handlers_lockcnt);
	QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) {
	if (ioh->revents & G_IO_OUT) {
	ioh->fd_write(ioh->opaque);
	}
	}

	if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) {
	QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
	if (ioh->deleted) {
	QLIST_REMOVE(ioh, next);
	g_free(ioh);
	}
	}
	qemu_lockcnt_unlock(&io_handlers_lockcnt);
	}

	Again, the RCU primitives are used because new items can be added to the
	list during the walk. ``QLIST_FOREACH_RCU`` ensures that the processor and
	the compiler see the appropriate memory barriers.

	An alternative pattern uses ``qemu_lockcnt_dec_if_lock``::

	qemu_lockcnt_inc(&io_handlers_lockcnt);
	QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) {
	if (ioh->deleted) {
	if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) {
	QLIST_REMOVE(ioh, next);
	g_free(ioh);
	qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt);
	}
	} else {
	if (ioh->revents & G_IO_OUT) {
	ioh->fd_write(ioh->opaque);
	}
	}
	}
	qemu_lockcnt_dec(&io_handlers_lockcnt);

	Here you can use ``qemu_lockcnt_dec`` instead of ``qemu_lockcnt_dec_and_lock``,
	because there is no special task to do if the count goes from 1 to 0.