Stefan Hajnoczi | d02d8dd | 2017-12-07 20:13:16 +0000 | [diff] [blame] | 1 | Copyright (c) 2014-2017 Red Hat Inc. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 2 | |
| 3 | This work is licensed under the terms of the GNU GPL, version 2 or later. See |
| 4 | the COPYING file in the top-level directory. |
| 5 | |
| 6 | |
| 7 | This document explains the IOThread feature and how to write code that runs |
Stefan Hajnoczi | 0b2675c | 2024-01-02 10:35:29 -0500 | [diff] [blame] | 8 | outside the BQL. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 9 | |
| 10 | The main loop and IOThreads |
| 11 | --------------------------- |
| 12 | QEMU is an event-driven program that can do several things at once using an |
| 13 | event loop. The VNC server and the QMP monitor are both processed from the |
| 14 | same event loop, which monitors their file descriptors until they become |
| 15 | readable and then invokes a callback. |
| 16 | |
| 17 | The default event loop is called the main loop (see main-loop.c). It is |
| 18 | possible to create additional event loop threads using -object |
| 19 | iothread,id=my-iothread. |
| 20 | |
| 21 | Side note: The main loop and IOThread are both event loops but their code is |
| 22 | not shared completely. Sometimes it is useful to remember that although they |
| 23 | are conceptually similar they are currently not interchangeable. |
| 24 | |
| 25 | Why IOThreads are useful |
| 26 | ------------------------ |
| 27 | IOThreads allow the user to control the placement of work. The main loop is a |
| 28 | scalability bottleneck on hosts with many CPUs. Work can be spread across |
| 29 | several IOThreads instead of just one main loop. When set up correctly this |
| 30 | can improve I/O latency and reduce jitter seen by the guest. |
| 31 | |
Stefan Hajnoczi | 0b2675c | 2024-01-02 10:35:29 -0500 | [diff] [blame] | 32 | The main loop is also deeply associated with the BQL, which is a |
| 33 | scalability bottleneck in itself. vCPU threads and the main loop use the BQL |
| 34 | to serialize execution of QEMU code. This mutex is necessary because a lot of |
| 35 | QEMU's code historically was not thread-safe. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 36 | |
| 37 | The fact that all I/O processing is done in a single main loop and that the |
Stefan Hajnoczi | 0b2675c | 2024-01-02 10:35:29 -0500 | [diff] [blame] | 38 | BQL is contended by all vCPU threads and the main loop explain |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 39 | why it is desirable to place work into IOThreads. |
| 40 | |
| 41 | The experimental virtio-blk data-plane implementation has been benchmarked and |
| 42 | shows these effects: |
| 43 | ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf |
| 44 | |
| 45 | How to program for IOThreads |
| 46 | ---------------------------- |
| 47 | The main difference between legacy code and new code that can run in an |
| 48 | IOThread is dealing explicitly with the event loop object, AioContext |
| 49 | (see include/block/aio.h). Code that only works in the main loop |
| 50 | implicitly uses the main loop's AioContext. Code that supports running |
| 51 | in IOThreads must be aware of its AioContext. |
| 52 | |
| 53 | AioContext supports the following services: |
| 54 | * File descriptor monitoring (read/write/error on POSIX hosts) |
| 55 | * Event notifiers (inter-thread signalling) |
| 56 | * Timers |
| 57 | * Bottom Halves (BH) deferred callbacks |
| 58 | |
| 59 | There are several old APIs that use the main loop AioContext: |
| 60 | * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor |
| 61 | * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier |
| 62 | * LEGACY timer_new_ms() - create a timer |
| 63 | * LEGACY qemu_bh_new() - create a BH |
Alexander Bulekov | 9c86c97 | 2023-04-27 17:10:07 -0400 | [diff] [blame] | 64 | * LEGACY qemu_bh_new_guarded() - create a BH with a device re-entrancy guard |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 65 | * LEGACY qemu_aio_wait() - run an event loop iteration |
| 66 | |
| 67 | Since they implicitly work on the main loop they cannot be used in code that |
| 68 | runs in an IOThread. They might cause a crash or deadlock if called from an |
Stefan Hajnoczi | 0b2675c | 2024-01-02 10:35:29 -0500 | [diff] [blame] | 69 | IOThread since the BQL is not held. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 70 | |
| 71 | Instead, use the AioContext functions directly (see include/block/aio.h): |
| 72 | * aio_set_fd_handler() - monitor a file descriptor |
| 73 | * aio_set_event_notifier() - monitor an event notifier |
| 74 | * aio_timer_new() - create a timer |
| 75 | * aio_bh_new() - create a BH |
Alexander Bulekov | 9c86c97 | 2023-04-27 17:10:07 -0400 | [diff] [blame] | 76 | * aio_bh_new_guarded() - create a BH with a device re-entrancy guard |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 77 | * aio_poll() - run an event loop iteration |
| 78 | |
Alexander Bulekov | 9c86c97 | 2023-04-27 17:10:07 -0400 | [diff] [blame] | 79 | The qemu_bh_new_guarded/aio_bh_new_guarded APIs accept a "MemReentrancyGuard" |
| 80 | argument, which is used to check for and prevent re-entrancy problems. For |
| 81 | BHs associated with devices, the reentrancy-guard is contained in the |
| 82 | corresponding DeviceState and named "mem_reentrancy_guard". |
| 83 | |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 84 | The AioContext can be obtained from the IOThread using |
| 85 | iothread_get_aio_context() or for the main loop using qemu_get_aio_context(). |
| 86 | Code that takes an AioContext argument works both in IOThreads or the main |
| 87 | loop, depending on which AioContext instance the caller passes in. |
| 88 | |
| 89 | How to synchronize with an IOThread |
| 90 | ----------------------------------- |
Stefan Hajnoczi | e0444c2 | 2023-12-05 13:20:08 -0500 | [diff] [blame] | 91 | Variables that can be accessed by multiple threads require some form of |
| 92 | synchronization such as qemu_mutex_lock(), rcu_read_lock(), etc. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 93 | |
Stefan Hajnoczi | e0444c2 | 2023-12-05 13:20:08 -0500 | [diff] [blame] | 94 | AioContext functions like aio_set_fd_handler(), aio_set_event_notifier(), |
| 95 | aio_bh_new(), and aio_timer_new() are thread-safe. They can be used to trigger |
| 96 | activity in an IOThread. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 97 | |
Paolo Bonzini | 7c690fd | 2017-01-12 19:07:59 +0100 | [diff] [blame] | 98 | Side note: the best way to schedule a function call across threads is to call |
Stefan Hajnoczi | e0444c2 | 2023-12-05 13:20:08 -0500 | [diff] [blame] | 99 | aio_bh_schedule_oneshot(). |
| 100 | |
| 101 | The main loop thread can wait synchronously for a condition using |
| 102 | AIO_WAIT_WHILE(). |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 103 | |
Paolo Bonzini | 65c1b5b | 2016-10-27 12:49:06 +0200 | [diff] [blame] | 104 | AioContext and the block layer |
| 105 | ------------------------------ |
| 106 | The AioContext originates from the QEMU block layer, even though nowadays |
| 107 | AioContext is a generic event loop that can be used by any QEMU subsystem. |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 108 | |
| 109 | The block layer has support for AioContext integrated. Each BlockDriverState |
Emanuele Giuseppe Esposito | 142e690 | 2022-10-25 04:49:52 -0400 | [diff] [blame] | 110 | is associated with an AioContext using bdrv_try_change_aio_context() and |
Stefan Hajnoczi | ef55869 | 2014-07-23 12:55:32 +0100 | [diff] [blame] | 111 | bdrv_get_aio_context(). This allows block layer code to process I/O inside the |
| 112 | right AioContext. Other subsystems may wish to follow a similar approach. |
| 113 | |
| 114 | Block layer code must therefore expect to run in an IOThread and avoid using |
| 115 | old APIs that implicitly use the main loop. See the "How to program for |
| 116 | IOThreads" above for information on how to do that. |
| 117 | |
Paolo Bonzini | 65c1b5b | 2016-10-27 12:49:06 +0200 | [diff] [blame] | 118 | Code running in the monitor typically needs to ensure that past |
| 119 | requests from the guest are completed. When a block device is running |
| 120 | in an IOThread, the IOThread can also process requests from the guest |
| 121 | (via ioeventfd). To achieve both objects, wrap the code between |
| 122 | bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained |
Stefan Hajnoczi | e0444c2 | 2023-12-05 13:20:08 -0500 | [diff] [blame] | 123 | section". |
Paolo Bonzini | 65c1b5b | 2016-10-27 12:49:06 +0200 | [diff] [blame] | 124 | |
Stefan Hajnoczi | e0444c2 | 2023-12-05 13:20:08 -0500 | [diff] [blame] | 125 | Long-running jobs (usually in the form of coroutines) are often scheduled in |
| 126 | the BlockDriverState's AioContext. The functions |
| 127 | bdrv_add/remove_aio_context_notifier, or alternatively |
| 128 | blk_add/remove_aio_context_notifier if you use BlockBackends, can be used to |
| 129 | get a notification whenever bdrv_try_change_aio_context() moves a |
Paolo Bonzini | 65c1b5b | 2016-10-27 12:49:06 +0200 | [diff] [blame] | 130 | BlockDriverState to a different AioContext. |