docs/devel/multiple-iothreads.txt - qemu - Git at Google

 Copyright (c) 2014-2017 Red Hat Inc.

 This work is licensed under the terms of the GNU GPL, version 2 or later.  See
 the COPYING file in the top-level directory.


 This document explains the IOThread feature and how to write code that runs
 outside the QEMU global mutex.

 The main loop and IOThreads
 ---------------------------
 QEMU is an event-driven program that can do several things at once using an
 event loop.  The VNC server and the QMP monitor are both processed from the
 same event loop, which monitors their file descriptors until they become
 readable and then invokes a callback.

 The default event loop is called the main loop (see main-loop.c).  It is
 possible to create additional event loop threads using -object
 iothread,id=my-iothread.

 Side note: The main loop and IOThread are both event loops but their code is
 not shared completely.  Sometimes it is useful to remember that although they
 are conceptually similar they are currently not interchangeable.

 Why IOThreads are useful
 ------------------------
 IOThreads allow the user to control the placement of work.  The main loop is a
 scalability bottleneck on hosts with many CPUs.  Work can be spread across
 several IOThreads instead of just one main loop.  When set up correctly this
 can improve I/O latency and reduce jitter seen by the guest.

 The main loop is also deeply associated with the QEMU global mutex, which is a
 scalability bottleneck in itself.  vCPU threads and the main loop use the QEMU
 global mutex to serialize execution of QEMU code.  This mutex is necessary
 because a lot of QEMU's code historically was not thread-safe.

 The fact that all I/O processing is done in a single main loop and that the
 QEMU global mutex is contended by all vCPU threads and the main loop explain
 why it is desirable to place work into IOThreads.

 The experimental virtio-blk data-plane implementation has been benchmarked and
 shows these effects:
 ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf

 How to program for IOThreads
 ----------------------------
 The main difference between legacy code and new code that can run in an
 IOThread is dealing explicitly with the event loop object, AioContext
 (see include/block/aio.h).  Code that only works in the main loop
 implicitly uses the main loop's AioContext.  Code that supports running
 in IOThreads must be aware of its AioContext.

 AioContext supports the following services:
  * File descriptor monitoring (read/write/error on POSIX hosts)
  * Event notifiers (inter-thread signalling)
  * Timers
  * Bottom Halves (BH) deferred callbacks

 There are several old APIs that use the main loop AioContext:
  * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
  * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
  * LEGACY timer_new_ms() - create a timer
  * LEGACY qemu_bh_new() - create a BH
  * LEGACY qemu_bh_new_guarded() - create a BH with a device re-entrancy guard
  * LEGACY qemu_aio_wait() - run an event loop iteration

 Since they implicitly work on the main loop they cannot be used in code that
 runs in an IOThread.  They might cause a crash or deadlock if called from an
 IOThread since the QEMU global mutex is not held.

 Instead, use the AioContext functions directly (see include/block/aio.h):
  * aio_set_fd_handler() - monitor a file descriptor
  * aio_set_event_notifier() - monitor an event notifier
  * aio_timer_new() - create a timer
  * aio_bh_new() - create a BH
  * aio_bh_new_guarded() - create a BH with a device re-entrancy guard
  * aio_poll() - run an event loop iteration

 The qemu_bh_new_guarded/aio_bh_new_guarded APIs accept a "MemReentrancyGuard"
 argument, which is used to check for and prevent re-entrancy problems. For
 BHs associated with devices, the reentrancy-guard is contained in the
 corresponding DeviceState and named "mem_reentrancy_guard".

 The AioContext can be obtained from the IOThread using
 iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
 Code that takes an AioContext argument works both in IOThreads or the main
 loop, depending on which AioContext instance the caller passes in.

 How to synchronize with an IOThread
 -----------------------------------
 AioContext is not thread-safe so some rules must be followed when using file
 descriptors, event notifiers, timers, or BHs across threads:

 1. AioContext functions can always be called safely.  They handle their
 own locking internally.

 2. Other threads wishing to access the AioContext must use
 aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the
 context is acquired no other thread can access it or run event loop iterations
 in this AioContext.

 Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls.
 Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro
 used in the block layer and can lead to hangs.

 There is currently no lock ordering rule if a thread needs to acquire multiple
 AioContexts simultaneously.  Therefore, it is only safe for code holding the
 QEMU global mutex to acquire other AioContexts.

 Side note: the best way to schedule a function call across threads is to call
 aio_bh_schedule_oneshot().  No acquire/release or locking is needed.

 AioContext and the block layer
 ------------------------------
 The AioContext originates from the QEMU block layer, even though nowadays
 AioContext is a generic event loop that can be used by any QEMU subsystem.

 The block layer has support for AioContext integrated.  Each BlockDriverState
 is associated with an AioContext using bdrv_try_change_aio_context() and
 bdrv_get_aio_context().  This allows block layer code to process I/O inside the
 right AioContext.  Other subsystems may wish to follow a similar approach.

 Block layer code must therefore expect to run in an IOThread and avoid using
 old APIs that implicitly use the main loop.  See the "How to program for
 IOThreads" above for information on how to do that.

 If main loop code such as a QMP function wishes to access a BlockDriverState
 it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
 that callbacks in the IOThread do not run in parallel.

 Code running in the monitor typically needs to ensure that past
 requests from the guest are completed.  When a block device is running
 in an IOThread, the IOThread can also process requests from the guest
 (via ioeventfd).  To achieve both objects, wrap the code between
 bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
 section".  The functions must be called between aio_context_acquire()
 and aio_context_release().  You can freely release and re-acquire the
 AioContext within a drained section.

 Long-running jobs (usually in the form of coroutines) are best scheduled in
 the BlockDriverState's AioContext to avoid the need to acquire/release around
 each bdrv_*() call.  The functions bdrv_add/remove_aio_context_notifier,
 or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
 can be used to get a notification whenever bdrv_try_change_aio_context() moves a
 BlockDriverState to a different AioContext.
	Copyright (c) 2014-2017 Red Hat Inc.

	This work is licensed under the terms of the GNU GPL, version 2 or later. See
	the COPYING file in the top-level directory.


	This document explains the IOThread feature and how to write code that runs
	outside the QEMU global mutex.

	The main loop and IOThreads
	---------------------------
	QEMU is an event-driven program that can do several things at once using an
	event loop. The VNC server and the QMP monitor are both processed from the
	same event loop, which monitors their file descriptors until they become
	readable and then invokes a callback.

	The default event loop is called the main loop (see main-loop.c). It is
	possible to create additional event loop threads using -object
	iothread,id=my-iothread.

	Side note: The main loop and IOThread are both event loops but their code is
	not shared completely. Sometimes it is useful to remember that although they
	are conceptually similar they are currently not interchangeable.

	Why IOThreads are useful
	------------------------
	IOThreads allow the user to control the placement of work. The main loop is a
	scalability bottleneck on hosts with many CPUs. Work can be spread across
	several IOThreads instead of just one main loop. When set up correctly this
	can improve I/O latency and reduce jitter seen by the guest.

	The main loop is also deeply associated with the QEMU global mutex, which is a
	scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
	global mutex to serialize execution of QEMU code. This mutex is necessary
	because a lot of QEMU's code historically was not thread-safe.

	The fact that all I/O processing is done in a single main loop and that the
	QEMU global mutex is contended by all vCPU threads and the main loop explain
	why it is desirable to place work into IOThreads.

	The experimental virtio-blk data-plane implementation has been benchmarked and
	shows these effects:
	ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf

	How to program for IOThreads
	----------------------------
	The main difference between legacy code and new code that can run in an
	IOThread is dealing explicitly with the event loop object, AioContext
	(see include/block/aio.h). Code that only works in the main loop
	implicitly uses the main loop's AioContext. Code that supports running
	in IOThreads must be aware of its AioContext.

	AioContext supports the following services:
	* File descriptor monitoring (read/write/error on POSIX hosts)
	* Event notifiers (inter-thread signalling)
	* Timers
	* Bottom Halves (BH) deferred callbacks

	There are several old APIs that use the main loop AioContext:
	* LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
	* LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
	* LEGACY timer_new_ms() - create a timer
	* LEGACY qemu_bh_new() - create a BH
	* LEGACY qemu_bh_new_guarded() - create a BH with a device re-entrancy guard
	* LEGACY qemu_aio_wait() - run an event loop iteration

	Since they implicitly work on the main loop they cannot be used in code that
	runs in an IOThread. They might cause a crash or deadlock if called from an
	IOThread since the QEMU global mutex is not held.

	Instead, use the AioContext functions directly (see include/block/aio.h):
	* aio_set_fd_handler() - monitor a file descriptor
	* aio_set_event_notifier() - monitor an event notifier
	* aio_timer_new() - create a timer
	* aio_bh_new() - create a BH
	* aio_bh_new_guarded() - create a BH with a device re-entrancy guard
	* aio_poll() - run an event loop iteration

	The qemu_bh_new_guarded/aio_bh_new_guarded APIs accept a "MemReentrancyGuard"
	argument, which is used to check for and prevent re-entrancy problems. For
	BHs associated with devices, the reentrancy-guard is contained in the
	corresponding DeviceState and named "mem_reentrancy_guard".

	The AioContext can be obtained from the IOThread using
	iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
	Code that takes an AioContext argument works both in IOThreads or the main
	loop, depending on which AioContext instance the caller passes in.

	How to synchronize with an IOThread
	-----------------------------------
	AioContext is not thread-safe so some rules must be followed when using file
	descriptors, event notifiers, timers, or BHs across threads:

	1. AioContext functions can always be called safely. They handle their
	own locking internally.

	2. Other threads wishing to access the AioContext must use
	aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
	context is acquired no other thread can access it or run event loop iterations
	in this AioContext.

	Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls.
	Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro
	used in the block layer and can lead to hangs.

	There is currently no lock ordering rule if a thread needs to acquire multiple
	AioContexts simultaneously. Therefore, it is only safe for code holding the
	QEMU global mutex to acquire other AioContexts.

	Side note: the best way to schedule a function call across threads is to call
	aio_bh_schedule_oneshot(). No acquire/release or locking is needed.

	AioContext and the block layer
	------------------------------
	The AioContext originates from the QEMU block layer, even though nowadays
	AioContext is a generic event loop that can be used by any QEMU subsystem.

	The block layer has support for AioContext integrated. Each BlockDriverState
	is associated with an AioContext using bdrv_try_change_aio_context() and
	bdrv_get_aio_context(). This allows block layer code to process I/O inside the
	right AioContext. Other subsystems may wish to follow a similar approach.

	Block layer code must therefore expect to run in an IOThread and avoid using
	old APIs that implicitly use the main loop. See the "How to program for
	IOThreads" above for information on how to do that.

	If main loop code such as a QMP function wishes to access a BlockDriverState
	it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
	that callbacks in the IOThread do not run in parallel.

	Code running in the monitor typically needs to ensure that past
	requests from the guest are completed. When a block device is running
	in an IOThread, the IOThread can also process requests from the guest
	(via ioeventfd). To achieve both objects, wrap the code between
	bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
	section". The functions must be called between aio_context_acquire()
	and aio_context_release(). You can freely release and re-acquire the
	AioContext within a drained section.

	Long-running jobs (usually in the form of coroutines) are best scheduled in
	the BlockDriverState's AioContext to avoid the need to acquire/release around
	each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier,
	or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
	can be used to get a notification whenever bdrv_try_change_aio_context() moves a
	BlockDriverState to a different AioContext.