Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 1 | Copyright (c) 2010-2015 Institute for System Programming |
| 2 | of the Russian Academy of Sciences. |
| 3 | |
| 4 | This work is licensed under the terms of the GNU GPL, version 2 or later. |
| 5 | See the COPYING file in the top-level directory. |
| 6 | |
| 7 | Record/replay |
| 8 | ------------- |
| 9 | |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 10 | Record/replay functions are used for the deterministic replay of qemu execution. |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 11 | Execution recording writes a non-deterministic events log, which can be later |
| 12 | used for replaying the execution anywhere and for unlimited number of times. |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 13 | It also supports checkpointing for faster rewind to the specific replay moment. |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 14 | Execution replaying reads the log and replays all non-deterministic events |
| 15 | including external input, hardware clocks, and interrupts. |
| 16 | |
| 17 | Deterministic replay has the following features: |
| 18 | * Deterministically replays whole system execution and all contents of |
| 19 | the memory, state of the hardware devices, clocks, and screen of the VM. |
| 20 | * Writes execution log into the file for later replaying for multiple times |
| 21 | on different machines. |
| 22 | * Supports i386, x86_64, and ARM hardware platforms. |
| 23 | * Performs deterministic replay of all operations with keyboard and mouse |
| 24 | input devices. |
| 25 | |
| 26 | Usage of the record/replay: |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 27 | * First, record the execution with the following command line: |
| 28 | qemu-system-i386 \ |
| 29 | -icount shift=7,rr=record,rrfile=replay.bin \ |
| 30 | -drive file=disk.qcow2,if=none,id=img-direct \ |
| 31 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ |
| 32 | -device ide-hd,drive=img-blkreplay \ |
| 33 | -netdev user,id=net1 -device rtl8139,netdev=net1 \ |
| 34 | -object filter-replay,id=replay,netdev=net1 |
| 35 | * After recording, you can replay it by using another command line: |
| 36 | qemu-system-i386 \ |
| 37 | -icount shift=7,rr=replay,rrfile=replay.bin \ |
| 38 | -drive file=disk.qcow2,if=none,id=img-direct \ |
| 39 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ |
| 40 | -device ide-hd,drive=img-blkreplay \ |
| 41 | -netdev user,id=net1 -device rtl8139,netdev=net1 \ |
| 42 | -object filter-replay,id=replay,netdev=net1 |
| 43 | The only difference with recording is changing the rr option |
| 44 | from record to replay. |
| 45 | * Block device images are not actually changed in the recording mode, |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 46 | because all of the changes are written to the temporary overlay file. |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 47 | This behavior is enabled by using blkreplay driver. It should be used |
| 48 | for every enabled block device, as described in 'Block devices' section. |
| 49 | * '-net none' option should be specified when network is not used, |
| 50 | because QEMU adds network card by default. When network is needed, |
| 51 | it should be configured explicitly with replay filter, as described |
| 52 | in 'Network devices' section. |
| 53 | * Interaction with audio devices and serial ports are recorded and replayed |
| 54 | automatically when such devices are enabled. |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 55 | |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 56 | Academic papers with description of deterministic replay implementation: |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 57 | http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html |
| 58 | http://dl.acm.org/citation.cfm?id=2786805.2803179 |
| 59 | |
| 60 | Modifications of qemu include: |
| 61 | * wrappers for clock and time functions to save their return values in the log |
| 62 | * saving different asynchronous events (e.g. system shutdown) into the log |
| 63 | * synchronization of the bottom halves execution |
| 64 | * synchronization of the threads from thread pool |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 65 | * recording/replaying user input (mouse, keyboard, and microphone) |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 66 | * adding internal checkpoints for cpu and io synchronization |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 67 | * network filter for recording and replaying the packets |
| 68 | * block driver for making block layer deterministic |
| 69 | * serial port input record and replay |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 70 | |
Alex Bennée | d759c95 | 2018-02-27 12:52:48 +0300 | [diff] [blame] | 71 | Locking and thread synchronisation |
| 72 | ---------------------------------- |
| 73 | |
| 74 | Previously the synchronisation of the main thread and the vCPU thread |
| 75 | was ensured by the holding of the BQL. However the trend has been to |
| 76 | reduce the time the BQL was held across the system including under TCG |
| 77 | system emulation. As it is important that batches of events are kept |
| 78 | in sequence (e.g. expiring timers and checkpoints in the main thread |
| 79 | while instruction checkpoints are written by the vCPU thread) we need |
| 80 | another lock to keep things in lock-step. This role is now handled by |
| 81 | the replay_mutex_lock. It used to be held only for each event being |
| 82 | written but now it is held for a whole execution period. This results |
| 83 | in a deterministic ping-pong between the two main threads. |
| 84 | |
| 85 | As the BQL is now a finer grained lock than the replay_lock it is almost |
| 86 | certainly a bug, and a source of deadlocks, to take the |
| 87 | replay_mutex_lock while the BQL is held. This is enforced by an assert. |
| 88 | While the unlocks are usually in the reverse order, this is not |
| 89 | necessary; you can drop the replay_lock while holding the BQL, without |
| 90 | doing a more complicated unlock_iothread/replay_unlock/lock_iothread |
| 91 | sequence. |
| 92 | |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 93 | Non-deterministic events |
| 94 | ------------------------ |
| 95 | |
| 96 | Our record/replay system is based on saving and replaying non-deterministic |
| 97 | events (e.g. keyboard input) and simulating deterministic ones (e.g. reading |
| 98 | from HDD or memory of the VM). Saving only non-deterministic events makes |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 99 | log file smaller and simulation faster. |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 100 | |
| 101 | The following non-deterministic data from peripheral devices is saved into |
| 102 | the log: mouse and keyboard input, network packets, audio controller input, |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 103 | serial port input, and hardware clocks (they are non-deterministic |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 104 | too, because their values are taken from the host machine). Inputs from |
| 105 | simulated hardware, memory of VM, software interrupts, and execution of |
| 106 | instructions are not saved into the log, because they are deterministic and |
| 107 | can be replayed by simulating the behavior of virtual machine starting from |
| 108 | initial state. |
| 109 | |
| 110 | We had to solve three tasks to implement deterministic replay: recording |
| 111 | non-deterministic events, replaying non-deterministic events, and checking |
| 112 | that there is no divergence between record and replay modes. |
| 113 | |
| 114 | We changed several parts of QEMU to make event log recording and replaying. |
| 115 | Devices' models that have non-deterministic input from external devices were |
| 116 | changed to write every external event into the execution log immediately. |
| 117 | E.g. network packets are written into the log when they arrive into the virtual |
| 118 | network adapter. |
| 119 | |
| 120 | All non-deterministic events are coming from these devices. But to |
| 121 | replay them we need to know at which moments they occur. We specify |
| 122 | these moments by counting the number of instructions executed between |
| 123 | every pair of consecutive events. |
| 124 | |
| 125 | Instruction counting |
| 126 | -------------------- |
| 127 | |
| 128 | QEMU should work in icount mode to use record/replay feature. icount was |
| 129 | designed to allow deterministic execution in absence of external inputs |
| 130 | of the virtual machine. We also use icount to control the occurrence of the |
| 131 | non-deterministic events. The number of instructions elapsed from the last event |
| 132 | is written to the log while recording the execution. In replay mode we |
| 133 | can predict when to inject that event using the instruction counter. |
| 134 | |
| 135 | Timers |
| 136 | ------ |
| 137 | |
| 138 | Timers are used to execute callbacks from different subsystems of QEMU |
| 139 | at the specified moments of time. There are several kinds of timers: |
| 140 | * Real time clock. Based on host time and used only for callbacks that |
| 141 | do not change the virtual machine state. For this reason real time |
| 142 | clock and timers does not affect deterministic replay at all. |
| 143 | * Virtual clock. These timers run only during the emulation. In icount |
| 144 | mode virtual clock value is calculated using executed instructions counter. |
| 145 | That is why it is completely deterministic and does not have to be recorded. |
| 146 | * Host clock. This clock is used by device models that simulate real time |
| 147 | sources (e.g. real time clock chip). Host clock is the one of the sources |
| 148 | of non-determinism. Host clock read operations should be logged to |
| 149 | make the execution deterministic. |
Pavel Dovgalyuk | e76d179 | 2016-03-10 14:56:09 +0300 | [diff] [blame] | 150 | * Virtual real time clock. This clock is similar to real time clock but |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 151 | it is used only for increasing virtual clock while virtual machine is |
| 152 | sleeping. Due to its nature it is also non-deterministic as the host clock |
| 153 | and has to be logged too. |
| 154 | |
| 155 | Checkpoints |
| 156 | ----------- |
| 157 | |
| 158 | Replaying of the execution of virtual machine is bound by sources of |
| 159 | non-determinism. These are inputs from clock and peripheral devices, |
| 160 | and QEMU thread scheduling. Thread scheduling affect on processing events |
| 161 | from timers, asynchronous input-output, and bottom halves. |
| 162 | |
| 163 | Invocations of timers are coupled with clock reads and changing the state |
| 164 | of the virtual machine. Reads produce non-deterministic data taken from |
| 165 | host clock. And VM state changes should preserve their order. Their relative |
| 166 | order in replay mode must replicate the order of callbacks in record mode. |
| 167 | To preserve this order we use checkpoints. When a specific clock is processed |
| 168 | in record mode we save to the log special "checkpoint" event. |
| 169 | Checkpoints here do not refer to virtual machine snapshots. They are just |
| 170 | record/replay events used for synchronization. |
| 171 | |
| 172 | QEMU in replay mode will try to invoke timers processing in random moment |
| 173 | of time. That's why we do not process a group of timers until the checkpoint |
| 174 | event will be read from the log. Such an event allows synchronizing CPU |
| 175 | execution and timer events. |
| 176 | |
Pavel Dovgalyuk | e76d179 | 2016-03-10 14:56:09 +0300 | [diff] [blame] | 177 | Two other checkpoints govern the "warping" of the virtual clock. |
| 178 | While the virtual machine is idle, the virtual clock increments at |
| 179 | 1 ns per *real time* nanosecond. This is done by setting up a timer |
| 180 | (called the warp timer) on the virtual real time clock, so that the |
| 181 | timer fires at the next deadline of the virtual clock; the virtual clock |
| 182 | is then incremented (which is called "warping" the virtual clock) as |
| 183 | soon as the timer fires or the CPUs need to go out of the idle state. |
| 184 | Two functions are used for this purpose; because these actions change |
| 185 | virtual machine state and must be deterministic, each of them creates a |
| 186 | checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so |
| 187 | starts accounting real time to virtual clock. qemu_account_warp_timer |
| 188 | is called when the CPUs get an interrupt or when the warp timer fires, |
| 189 | and it warps the virtual clock by the amount of real time that has passed |
| 190 | since qemu_start_warp_timer. |
Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 191 | |
| 192 | Bottom halves |
| 193 | ------------- |
| 194 | |
| 195 | Disk I/O events are completely deterministic in our model, because |
| 196 | in both record and replay modes we start virtual machine from the same |
| 197 | disk state. But callbacks that virtual disk controller uses for reading and |
| 198 | writing the disk may occur at different moments of time in record and replay |
| 199 | modes. |
| 200 | |
| 201 | Reading and writing requests are created by CPU thread of QEMU. Later these |
| 202 | requests proceed to block layer which creates "bottom halves". Bottom |
| 203 | halves consist of callback and its parameters. They are processed when |
| 204 | main loop locks the global mutex. These locks are not synchronized with |
| 205 | replaying process because main loop also processes the events that do not |
| 206 | affect the virtual machine state (like user interaction with monitor). |
| 207 | |
| 208 | That is why we had to implement saving and replaying bottom halves callbacks |
| 209 | synchronously to the CPU execution. When the callback is about to execute |
| 210 | it is added to the queue in the replay module. This queue is written to the |
| 211 | log when its callbacks are executed. In replay mode callbacks are not processed |
| 212 | until the corresponding event is read from the events log file. |
| 213 | |
| 214 | Sometimes the block layer uses asynchronous callbacks for its internal purposes |
| 215 | (like reading or writing VM snapshots or disk image cluster tables). In this |
| 216 | case bottom halves are not marked as "replayable" and do not saved |
| 217 | into the log. |
Pavel Dovgalyuk | 6378567 | 2016-03-14 10:45:10 +0300 | [diff] [blame] | 218 | |
| 219 | Block devices |
| 220 | ------------- |
| 221 | |
| 222 | Block devices record/replay module intercepts calls of |
| 223 | bdrv coroutine functions at the top of block drivers stack. |
| 224 | To record and replay block operations the drive must be configured |
| 225 | as following: |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 226 | -drive file=disk.qcow2,if=none,id=img-direct |
Pavel Dovgalyuk | 6378567 | 2016-03-14 10:45:10 +0300 | [diff] [blame] | 227 | -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay |
| 228 | -device ide-hd,drive=img-blkreplay |
| 229 | |
| 230 | blkreplay driver should be inserted between disk image and virtual driver |
| 231 | controller. Therefore all disk requests may be recorded and replayed. |
| 232 | |
| 233 | All block completion operations are added to the queue in the coroutines. |
| 234 | Queue is flushed at checkpoints and information about processed requests |
| 235 | is recorded to the log. In replay phase the queue is matched with |
| 236 | events read from the log. Therefore block devices requests are processed |
| 237 | deterministically. |
Pavel Dovgalyuk | 646c547 | 2016-09-26 11:08:21 +0300 | [diff] [blame] | 238 | |
Pavel Dovgalyuk | 9c2037d | 2017-01-24 10:17:47 +0300 | [diff] [blame] | 239 | Snapshotting |
| 240 | ------------ |
| 241 | |
| 242 | New VM snapshots may be created in replay mode. They can be used later |
| 243 | to recover the desired VM state. All VM states created in replay mode |
| 244 | are associated with the moment of time in the replay scenario. |
| 245 | After recovering the VM state replay will start from that position. |
| 246 | |
| 247 | Default starting snapshot name may be specified with icount field |
| 248 | rrsnapshot as follows: |
| 249 | -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name |
| 250 | |
| 251 | This snapshot is created at start of recording and restored at start |
| 252 | of replaying. It also can be loaded while replaying to roll back |
| 253 | the execution. |
| 254 | |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 255 | Use QEMU monitor to create additional snapshots. 'savevm <name>' command |
| 256 | created the snapshot and 'loadvm <name>' restores it. To prevent corruption |
| 257 | of the original disk image, use overlay files linked to the original images. |
| 258 | Therefore all new snapshots (including the starting one) will be saved in |
| 259 | overlays and the original image remains unchanged. |
| 260 | |
Pavel Dovgalyuk | 646c547 | 2016-09-26 11:08:21 +0300 | [diff] [blame] | 261 | Network devices |
| 262 | --------------- |
| 263 | |
| 264 | Record and replay for network interactions is performed with the network filter. |
| 265 | Each backend must have its own instance of the replay filter as follows: |
| 266 | -netdev user,id=net1 -device rtl8139,netdev=net1 |
| 267 | -object filter-replay,id=replay,netdev=net1 |
| 268 | |
| 269 | Replay network filter is used to record and replay network packets. While |
| 270 | recording the virtual machine this filter puts all packets coming from |
| 271 | the outer world into the log. In replay mode packets from the log are |
| 272 | injected into the network device. All interactions with network backend |
| 273 | in replay mode are disabled. |
Pavel Dovgalyuk | 3d4d16f | 2017-02-02 08:50:54 +0300 | [diff] [blame] | 274 | |
| 275 | Audio devices |
| 276 | ------------- |
| 277 | |
| 278 | Audio data is recorded and replay automatically. The command line for recording |
| 279 | and replaying must contain identical specifications of audio hardware, e.g.: |
| 280 | -soundhw ac97 |
Pavel Dovgalyuk | bb040e0 | 2018-02-27 12:52:20 +0300 | [diff] [blame] | 281 | |
Pavel Dovgalyuk | 7273db9 | 2018-02-27 12:53:33 +0300 | [diff] [blame] | 282 | Serial ports |
| 283 | ------------ |
| 284 | |
| 285 | Serial ports input is recorded and replay automatically. The command lines |
| 286 | for recording and replaying must contain identical number of ports in record |
| 287 | and replay modes, but their backends may differ. |
| 288 | E.g., '-serial stdio' in record mode, and '-serial null' in replay mode. |
| 289 | |
Pavel Dovgalyuk | bb040e0 | 2018-02-27 12:52:20 +0300 | [diff] [blame] | 290 | Replay log format |
| 291 | ----------------- |
| 292 | |
| 293 | Record/replay log consits of the header and the sequence of execution |
| 294 | events. The header includes 4-byte replay version id and 8-byte reserved |
| 295 | field. Version is updated every time replay log format changes to prevent |
| 296 | using replay log created by another build of qemu. |
| 297 | |
| 298 | The sequence of the events describes virtual machine state changes. |
| 299 | It includes all non-deterministic inputs of VM, synchronization marks and |
| 300 | instruction counts used to correctly inject inputs at replay. |
| 301 | |
| 302 | Synchronization marks (checkpoints) are used for synchronizing qemu threads |
| 303 | that perform operations with virtual hardware. These operations may change |
| 304 | system's state (e.g., change some register or generate interrupt) and |
| 305 | therefore should execute synchronously with CPU thread. |
| 306 | |
| 307 | Every event in the log includes 1-byte event id and optional arguments. |
| 308 | When argument is an array, it is stored as 4-byte array length |
| 309 | and corresponding number of bytes with data. |
| 310 | Here is the list of events that are written into the log: |
| 311 | |
| 312 | - EVENT_INSTRUCTION. Instructions executed since last event. |
| 313 | Argument: 4-byte number of executed instructions. |
| 314 | - EVENT_INTERRUPT. Used to synchronize interrupt processing. |
| 315 | - EVENT_EXCEPTION. Used to synchronize exception handling. |
| 316 | - EVENT_ASYNC. This is a group of events. They are always processed |
| 317 | together with checkpoints. When such an event is generated, it is |
| 318 | stored in the queue and processed only when checkpoint occurs. |
| 319 | Every such event is followed by 1-byte checkpoint id and 1-byte |
| 320 | async event id from the following list: |
| 321 | - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes |
| 322 | callbacks that affect virtual machine state, but normally called |
Stefan Weil | 963e64a | 2018-07-13 14:17:27 +0200 | [diff] [blame] | 323 | asynchronously. |
Pavel Dovgalyuk | bb040e0 | 2018-02-27 12:52:20 +0300 | [diff] [blame] | 324 | Argument: 8-byte operation id. |
| 325 | - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains |
| 326 | parameters of keyboard and mouse input operations |
| 327 | (key press/release, mouse pointer movement). |
| 328 | Arguments: 9-16 bytes depending of input event. |
| 329 | - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. |
| 330 | - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input |
| 331 | initiated by the sender. |
| 332 | Arguments: 1-byte character device id. |
| 333 | Array with bytes were read. |
| 334 | - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize |
| 335 | operations with disk and flash drives with CPU. |
| 336 | Argument: 8-byte operation id. |
| 337 | - REPLAY_ASYNC_EVENT_NET. Incoming network packet. |
| 338 | Arguments: 1-byte network adapter id. |
| 339 | 4-byte packet flags. |
| 340 | Array with packet bytes. |
| 341 | - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, |
| 342 | e.g., by closing the window. |
| 343 | - EVENT_CHAR_WRITE. Used to synchronize character output operations. |
| 344 | Arguments: 4-byte output function return value. |
| 345 | 4-byte offset in the output array. |
| 346 | - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, |
| 347 | initiated by qemu. |
| 348 | Argument: Array with bytes that were read. |
| 349 | - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, |
| 350 | initiated by qemu. |
| 351 | Argument: 4-byte error code. |
| 352 | - EVENT_CLOCK + clock_id. Group of events for host clock read operations. |
| 353 | Argument: 8-byte clock value. |
| 354 | - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of |
| 355 | CPU, internal threads, and asynchronous input events. May be followed |
| 356 | by one or more EVENT_ASYNC events. |
| 357 | - EVENT_END. Last event in the log. |