| Copyright (c) 2014-2017 Red Hat Inc. | 
 |  | 
 | This work is licensed under the terms of the GNU GPL, version 2 or later.  See | 
 | the COPYING file in the top-level directory. | 
 |  | 
 |  | 
 | This document explains the IOThread feature and how to write code that runs | 
 | outside the QEMU global mutex. | 
 |  | 
 | The main loop and IOThreads | 
 | --------------------------- | 
 | QEMU is an event-driven program that can do several things at once using an | 
 | event loop.  The VNC server and the QMP monitor are both processed from the | 
 | same event loop, which monitors their file descriptors until they become | 
 | readable and then invokes a callback. | 
 |  | 
 | The default event loop is called the main loop (see main-loop.c).  It is | 
 | possible to create additional event loop threads using -object | 
 | iothread,id=my-iothread. | 
 |  | 
 | Side note: The main loop and IOThread are both event loops but their code is | 
 | not shared completely.  Sometimes it is useful to remember that although they | 
 | are conceptually similar they are currently not interchangeable. | 
 |  | 
 | Why IOThreads are useful | 
 | ------------------------ | 
 | IOThreads allow the user to control the placement of work.  The main loop is a | 
 | scalability bottleneck on hosts with many CPUs.  Work can be spread across | 
 | several IOThreads instead of just one main loop.  When set up correctly this | 
 | can improve I/O latency and reduce jitter seen by the guest. | 
 |  | 
 | The main loop is also deeply associated with the QEMU global mutex, which is a | 
 | scalability bottleneck in itself.  vCPU threads and the main loop use the QEMU | 
 | global mutex to serialize execution of QEMU code.  This mutex is necessary | 
 | because a lot of QEMU's code historically was not thread-safe. | 
 |  | 
 | The fact that all I/O processing is done in a single main loop and that the | 
 | QEMU global mutex is contended by all vCPU threads and the main loop explain | 
 | why it is desirable to place work into IOThreads. | 
 |  | 
 | The experimental virtio-blk data-plane implementation has been benchmarked and | 
 | shows these effects: | 
 | ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf | 
 |  | 
 | How to program for IOThreads | 
 | ---------------------------- | 
 | The main difference between legacy code and new code that can run in an | 
 | IOThread is dealing explicitly with the event loop object, AioContext | 
 | (see include/block/aio.h).  Code that only works in the main loop | 
 | implicitly uses the main loop's AioContext.  Code that supports running | 
 | in IOThreads must be aware of its AioContext. | 
 |  | 
 | AioContext supports the following services: | 
 |  * File descriptor monitoring (read/write/error on POSIX hosts) | 
 |  * Event notifiers (inter-thread signalling) | 
 |  * Timers | 
 |  * Bottom Halves (BH) deferred callbacks | 
 |  | 
 | There are several old APIs that use the main loop AioContext: | 
 |  * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor | 
 |  * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier | 
 |  * LEGACY timer_new_ms() - create a timer | 
 |  * LEGACY qemu_bh_new() - create a BH | 
 |  * LEGACY qemu_bh_new_guarded() - create a BH with a device re-entrancy guard | 
 |  * LEGACY qemu_aio_wait() - run an event loop iteration | 
 |  | 
 | Since they implicitly work on the main loop they cannot be used in code that | 
 | runs in an IOThread.  They might cause a crash or deadlock if called from an | 
 | IOThread since the QEMU global mutex is not held. | 
 |  | 
 | Instead, use the AioContext functions directly (see include/block/aio.h): | 
 |  * aio_set_fd_handler() - monitor a file descriptor | 
 |  * aio_set_event_notifier() - monitor an event notifier | 
 |  * aio_timer_new() - create a timer | 
 |  * aio_bh_new() - create a BH | 
 |  * aio_bh_new_guarded() - create a BH with a device re-entrancy guard | 
 |  * aio_poll() - run an event loop iteration | 
 |  | 
 | The qemu_bh_new_guarded/aio_bh_new_guarded APIs accept a "MemReentrancyGuard" | 
 | argument, which is used to check for and prevent re-entrancy problems. For | 
 | BHs associated with devices, the reentrancy-guard is contained in the | 
 | corresponding DeviceState and named "mem_reentrancy_guard". | 
 |  | 
 | The AioContext can be obtained from the IOThread using | 
 | iothread_get_aio_context() or for the main loop using qemu_get_aio_context(). | 
 | Code that takes an AioContext argument works both in IOThreads or the main | 
 | loop, depending on which AioContext instance the caller passes in. | 
 |  | 
 | How to synchronize with an IOThread | 
 | ----------------------------------- | 
 | AioContext is not thread-safe so some rules must be followed when using file | 
 | descriptors, event notifiers, timers, or BHs across threads: | 
 |  | 
 | 1. AioContext functions can always be called safely.  They handle their | 
 | own locking internally. | 
 |  | 
 | 2. Other threads wishing to access the AioContext must use | 
 | aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the | 
 | context is acquired no other thread can access it or run event loop iterations | 
 | in this AioContext. | 
 |  | 
 | Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls. | 
 | Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro | 
 | used in the block layer and can lead to hangs. | 
 |  | 
 | There is currently no lock ordering rule if a thread needs to acquire multiple | 
 | AioContexts simultaneously.  Therefore, it is only safe for code holding the | 
 | QEMU global mutex to acquire other AioContexts. | 
 |  | 
 | Side note: the best way to schedule a function call across threads is to call | 
 | aio_bh_schedule_oneshot().  No acquire/release or locking is needed. | 
 |  | 
 | AioContext and the block layer | 
 | ------------------------------ | 
 | The AioContext originates from the QEMU block layer, even though nowadays | 
 | AioContext is a generic event loop that can be used by any QEMU subsystem. | 
 |  | 
 | The block layer has support for AioContext integrated.  Each BlockDriverState | 
 | is associated with an AioContext using bdrv_try_change_aio_context() and | 
 | bdrv_get_aio_context().  This allows block layer code to process I/O inside the | 
 | right AioContext.  Other subsystems may wish to follow a similar approach. | 
 |  | 
 | Block layer code must therefore expect to run in an IOThread and avoid using | 
 | old APIs that implicitly use the main loop.  See the "How to program for | 
 | IOThreads" above for information on how to do that. | 
 |  | 
 | If main loop code such as a QMP function wishes to access a BlockDriverState | 
 | it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure | 
 | that callbacks in the IOThread do not run in parallel. | 
 |  | 
 | Code running in the monitor typically needs to ensure that past | 
 | requests from the guest are completed.  When a block device is running | 
 | in an IOThread, the IOThread can also process requests from the guest | 
 | (via ioeventfd).  To achieve both objects, wrap the code between | 
 | bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained | 
 | section".  The functions must be called between aio_context_acquire() | 
 | and aio_context_release().  You can freely release and re-acquire the | 
 | AioContext within a drained section. | 
 |  | 
 | Long-running jobs (usually in the form of coroutines) are best scheduled in | 
 | the BlockDriverState's AioContext to avoid the need to acquire/release around | 
 | each bdrv_*() call.  The functions bdrv_add/remove_aio_context_notifier, | 
 | or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends, | 
 | can be used to get a notification whenever bdrv_try_change_aio_context() moves a | 
 | BlockDriverState to a different AioContext. |