|  | Using Multiple ``IOThread``\ s | 
|  | ============================== | 
|  |  | 
|  | .. | 
|  | Copyright (c) 2014-2017 Red Hat Inc. | 
|  |  | 
|  | This work is licensed under the terms of the GNU GPL, version 2 or later.  See | 
|  | the COPYING file in the top-level directory. | 
|  |  | 
|  |  | 
|  | This document explains the ``IOThread`` feature and how to write code that runs | 
|  | outside the BQL. | 
|  |  | 
|  | The main loop and ``IOThread``\ s | 
|  | --------------------------------- | 
|  | QEMU is an event-driven program that can do several things at once using an | 
|  | event loop.  The VNC server and the QMP monitor are both processed from the | 
|  | same event loop, which monitors their file descriptors until they become | 
|  | readable and then invokes a callback. | 
|  |  | 
|  | The default event loop is called the main loop (see ``main-loop.c``).  It is | 
|  | possible to create additional event loop threads using | 
|  | ``-object iothread,id=my-iothread``. | 
|  |  | 
|  | Side note: The main loop and ``IOThread`` are both event loops but their code is | 
|  | not shared completely.  Sometimes it is useful to remember that although they | 
|  | are conceptually similar they are currently not interchangeable. | 
|  |  | 
|  | Why ``IOThread``\ s are useful | 
|  | ------------------------------ | 
|  | ``IOThread``\ s allow the user to control the placement of work.  The main loop is a | 
|  | scalability bottleneck on hosts with many CPUs.  Work can be spread across | 
|  | several ``IOThread``\ s instead of just one main loop.  When set up correctly this | 
|  | can improve I/O latency and reduce jitter seen by the guest. | 
|  |  | 
|  | The main loop is also deeply associated with the BQL, which is a | 
|  | scalability bottleneck in itself.  vCPU threads and the main loop use the BQL | 
|  | to serialize execution of QEMU code.  This mutex is necessary because a lot of | 
|  | QEMU's code historically was not thread-safe. | 
|  |  | 
|  | The fact that all I/O processing is done in a single main loop and that the | 
|  | BQL is contended by all vCPU threads and the main loop explain | 
|  | why it is desirable to place work into ``IOThread``\ s. | 
|  |  | 
|  | The experimental ``virtio-blk`` data-plane implementation has been benchmarked and | 
|  | shows these effects: | 
|  | ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf | 
|  |  | 
|  | .. _how-to-program: | 
|  |  | 
|  | How to program for ``IOThread``\ s | 
|  | ---------------------------------- | 
|  | The main difference between legacy code and new code that can run in an | 
|  | ``IOThread`` is dealing explicitly with the event loop object, ``AioContext`` | 
|  | (see ``include/block/aio.h``).  Code that only works in the main loop | 
|  | implicitly uses the main loop's ``AioContext``.  Code that supports running | 
|  | in ``IOThread``\ s must be aware of its ``AioContext``. | 
|  |  | 
|  | AioContext supports the following services: | 
|  | * File descriptor monitoring (read/write/error on POSIX hosts) | 
|  | * Event notifiers (inter-thread signalling) | 
|  | * Timers | 
|  | * Bottom Halves (BH) deferred callbacks | 
|  |  | 
|  | There are several old APIs that use the main loop AioContext: | 
|  | * LEGACY ``qemu_aio_set_fd_handler()`` - monitor a file descriptor | 
|  | * LEGACY ``qemu_aio_set_event_notifier()`` - monitor an event notifier | 
|  | * LEGACY ``timer_new_ms()`` - create a timer | 
|  | * LEGACY ``qemu_bh_new()`` - create a BH | 
|  | * LEGACY ``qemu_bh_new_guarded()`` - create a BH with a device re-entrancy guard | 
|  | * LEGACY ``qemu_aio_wait()`` - run an event loop iteration | 
|  |  | 
|  | Since they implicitly work on the main loop they cannot be used in code that | 
|  | runs in an ``IOThread``.  They might cause a crash or deadlock if called from an | 
|  | ``IOThread`` since the BQL is not held. | 
|  |  | 
|  | Instead, use the ``AioContext`` functions directly (see ``include/block/aio.h``): | 
|  | * ``aio_set_fd_handler()`` - monitor a file descriptor | 
|  | * ``aio_set_event_notifier()`` - monitor an event notifier | 
|  | * ``aio_timer_new()`` - create a timer | 
|  | * ``aio_bh_new()`` - create a BH | 
|  | * ``aio_bh_new_guarded()`` - create a BH with a device re-entrancy guard | 
|  | * ``aio_poll()`` - run an event loop iteration | 
|  |  | 
|  | The ``qemu_bh_new_guarded``/``aio_bh_new_guarded`` APIs accept a | 
|  | ``MemReentrancyGuard`` | 
|  | argument, which is used to check for and prevent re-entrancy problems. For | 
|  | BHs associated with devices, the reentrancy-guard is contained in the | 
|  | corresponding ``DeviceState`` and named ``mem_reentrancy_guard``. | 
|  |  | 
|  | The ``AioContext`` can be obtained from the ``IOThread`` using | 
|  | ``iothread_get_aio_context()`` or for the main loop using | 
|  | ``qemu_get_aio_context()``. Code that takes an ``AioContext`` argument | 
|  | works both in ``IOThread``\ s or the main loop, depending on which ``AioContext`` | 
|  | instance the caller passes in. | 
|  |  | 
|  | How to synchronize with an ``IOThread`` | 
|  | --------------------------------------- | 
|  | Variables that can be accessed by multiple threads require some form of | 
|  | synchronization such as ``qemu_mutex_lock()``, ``rcu_read_lock()``, etc. | 
|  |  | 
|  | ``AioContext`` functions like ``aio_set_fd_handler()``, | 
|  | ``aio_set_event_notifier()``, ``aio_bh_new()``, and ``aio_timer_new()`` | 
|  | are thread-safe. They can be used to trigger activity in an ``IOThread``. | 
|  |  | 
|  | Side note: the best way to schedule a function call across threads is to call | 
|  | ``aio_bh_schedule_oneshot()``. | 
|  |  | 
|  | The main loop thread can wait synchronously for a condition using | 
|  | ``AIO_WAIT_WHILE()``. | 
|  |  | 
|  | ``AioContext`` and the block layer | 
|  | ---------------------------------- | 
|  | The ``AioContext`` originates from the QEMU block layer, even though nowadays | 
|  | ``AioContext`` is a generic event loop that can be used by any QEMU subsystem. | 
|  |  | 
|  | The block layer has support for ``AioContext`` integrated.  Each | 
|  | ``BlockDriverState`` is associated with an ``AioContext`` using | 
|  | ``bdrv_try_change_aio_context()`` and ``bdrv_get_aio_context()``. | 
|  | This allows block layer code to process I/O inside the | 
|  | right ``AioContext``.  Other subsystems may wish to follow a similar approach. | 
|  |  | 
|  | Block layer code must therefore expect to run in an ``IOThread`` and avoid using | 
|  | old APIs that implicitly use the main loop.  See | 
|  | `How to program for IOThreads`_ for information on how to do that. | 
|  |  | 
|  | Code running in the monitor typically needs to ensure that past | 
|  | requests from the guest are completed.  When a block device is running | 
|  | in an ``IOThread``, the ``IOThread`` can also process requests from the guest | 
|  | (via ioeventfd).  To achieve both objects, wrap the code between | 
|  | ``bdrv_drained_begin()`` and ``bdrv_drained_end()``, thus creating a "drained | 
|  | section". | 
|  |  | 
|  | Long-running jobs (usually in the form of coroutines) are often scheduled in | 
|  | the ``BlockDriverState``'s ``AioContext``.  The functions | 
|  | ``bdrv_add``/``remove_aio_context_notifier``, or alternatively | 
|  | ``blk_add``/``remove_aio_context_notifier`` if you use ``BlockBackends``, | 
|  | can be used to get a notification whenever ``bdrv_try_change_aio_context()`` | 
|  | moves a ``BlockDriverState`` to a different ``AioContext``. |