| .. |
| Copyright (c) 2022, ISP RAS |
| Written by Pavel Dovgalyuk and Alex Bennée |
| |
| ======================= |
| Execution Record/Replay |
| ======================= |
| |
| Core concepts |
| ============= |
| |
| Record/replay functions are used for the deterministic replay of qemu |
| execution. Execution recording writes a non-deterministic events log, which |
| can be later used for replaying the execution anywhere and for unlimited |
| number of times. Execution replaying reads the log and replays all |
| non-deterministic events including external input, hardware clocks, |
| and interrupts. |
| |
| Several parts of QEMU include function calls to make event log recording |
| and replaying. |
| Devices' models that have non-deterministic input from external devices were |
| changed to write every external event into the execution log immediately. |
| E.g. network packets are written into the log when they arrive into the virtual |
| network adapter. |
| |
| All non-deterministic events are coming from these devices. But to |
| replay them we need to know at which moments they occur. We specify |
| these moments by counting the number of instructions executed between |
| every pair of consecutive events. |
| |
| Academic papers with description of deterministic replay implementation: |
| |
| * `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_ |
| * `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_ |
| |
| Modifications of qemu include: |
| |
| * wrappers for clock and time functions to save their return values in the log |
| * saving different asynchronous events (e.g. system shutdown) into the log |
| * synchronization of the bottom halves execution |
| * synchronization of the threads from thread pool |
| * recording/replaying user input (mouse, keyboard, and microphone) |
| * adding internal checkpoints for cpu and io synchronization |
| * network filter for recording and replaying the packets |
| * block driver for making block layer deterministic |
| * serial port input record and replay |
| * recording of random numbers obtained from the external sources |
| |
| Instruction counting |
| -------------------- |
| |
| QEMU should work in icount mode to use record/replay feature. icount was |
| designed to allow deterministic execution in absence of external inputs |
| of the virtual machine. We also use icount to control the occurrence of the |
| non-deterministic events. The number of instructions elapsed from the last event |
| is written to the log while recording the execution. In replay mode we |
| can predict when to inject that event using the instruction counter. |
| |
| Locking and thread synchronisation |
| ---------------------------------- |
| |
| Previously the synchronisation of the main thread and the vCPU thread |
| was ensured by the holding of the BQL. However the trend has been to |
| reduce the time the BQL was held across the system including under TCG |
| system emulation. As it is important that batches of events are kept |
| in sequence (e.g. expiring timers and checkpoints in the main thread |
| while instruction checkpoints are written by the vCPU thread) we need |
| another lock to keep things in lock-step. This role is now handled by |
| the replay_mutex_lock. It used to be held only for each event being |
| written but now it is held for a whole execution period. This results |
| in a deterministic ping-pong between the two main threads. |
| |
| As the BQL is now a finer grained lock than the replay_lock it is almost |
| certainly a bug, and a source of deadlocks, to take the |
| replay_mutex_lock while the BQL is held. This is enforced by an assert. |
| While the unlocks are usually in the reverse order, this is not |
| necessary; you can drop the replay_lock while holding the BQL, without |
| doing a more complicated unlock_iothread/replay_unlock/lock_iothread |
| sequence. |
| |
| Checkpoints |
| ----------- |
| |
| Replaying the execution of virtual machine is bound by sources of |
| non-determinism. These are inputs from clock and peripheral devices, |
| and QEMU thread scheduling. Thread scheduling affect on processing events |
| from timers, asynchronous input-output, and bottom halves. |
| |
| Invocations of timers are coupled with clock reads and changing the state |
| of the virtual machine. Reads produce non-deterministic data taken from |
| host clock. And VM state changes should preserve their order. Their relative |
| order in replay mode must replicate the order of callbacks in record mode. |
| To preserve this order we use checkpoints. When a specific clock is processed |
| in record mode we save to the log special "checkpoint" event. |
| Checkpoints here do not refer to virtual machine snapshots. They are just |
| record/replay events used for synchronization. |
| |
| QEMU in replay mode will try to invoke timers processing in random moment |
| of time. That's why we do not process a group of timers until the checkpoint |
| event will be read from the log. Such an event allows synchronizing CPU |
| execution and timer events. |
| |
| Two other checkpoints govern the "warping" of the virtual clock. |
| While the virtual machine is idle, the virtual clock increments at |
| 1 ns per *real time* nanosecond. This is done by setting up a timer |
| (called the warp timer) on the virtual real time clock, so that the |
| timer fires at the next deadline of the virtual clock; the virtual clock |
| is then incremented (which is called "warping" the virtual clock) as |
| soon as the timer fires or the CPUs need to go out of the idle state. |
| Two functions are used for this purpose; because these actions change |
| virtual machine state and must be deterministic, each of them creates a |
| checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so |
| starts accounting real time to virtual clock. ``icount_account_warp_timer`` |
| is called when the CPUs get an interrupt or when the warp timer fires, |
| and it warps the virtual clock by the amount of real time that has passed |
| since ``icount_start_warp_timer``. |
| |
| Virtual devices |
| =============== |
| |
| Record/replay mechanism, that could be enabled through icount mode, expects |
| the virtual devices to satisfy the following requirement: |
| everything that affects |
| the guest state during execution in icount mode should be deterministic. |
| |
| Timers |
| ------ |
| |
| Timers are used to execute callbacks from different subsystems of QEMU |
| at the specified moments of time. There are several kinds of timers: |
| |
| * Real time clock. Based on host time and used only for callbacks that |
| do not change the virtual machine state. For this reason real time |
| clock and timers does not affect deterministic replay at all. |
| * Virtual clock. These timers run only during the emulation. In icount |
| mode virtual clock value is calculated using executed instructions counter. |
| That is why it is completely deterministic and does not have to be recorded. |
| * Host clock. This clock is used by device models that simulate real time |
| sources (e.g. real time clock chip). Host clock is the one of the sources |
| of non-determinism. Host clock read operations should be logged to |
| make the execution deterministic. |
| * Virtual real time clock. This clock is similar to real time clock but |
| it is used only for increasing virtual clock while virtual machine is |
| sleeping. Due to its nature it is also non-deterministic as the host clock |
| and has to be logged too. |
| |
| All virtual devices should use virtual clock for timers that change the guest |
| state. Virtual clock is deterministic, therefore such timers are deterministic |
| too. |
| |
| Virtual devices can also use realtime clock for the events that do not change |
| the guest state directly. When the clock ticking should depend on VM execution |
| speed, use virtual clock with EXTERNAL attribute. It is not deterministic, |
| but its speed depends on the guest execution. This clock is used by |
| the virtual devices (e.g., slirp routing device) that lie outside the |
| replayed guest. |
| |
| Block devices |
| ------------- |
| |
| Block devices record/replay module (``blkreplay``) intercepts calls of |
| bdrv coroutine functions at the top of block drivers stack. |
| |
| All block completion operations are added to the queue in the coroutines. |
| When the queue is flushed the information about processed requests |
| is recorded to the log. In replay phase the queue is matched with |
| events read from the log. Therefore block devices requests are processed |
| deterministically. |
| |
| Bottom halves |
| ------------- |
| |
| Bottom half callbacks, that affect the guest state, should be invoked through |
| ``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions. |
| Their invocations are saved in record mode and synchronized with the existing |
| log in replay mode. |
| |
| Disk I/O events are completely deterministic in our model, because |
| in both record and replay modes we start virtual machine from the same |
| disk state. But callbacks that virtual disk controller uses for reading and |
| writing the disk may occur at different moments of time in record and replay |
| modes. |
| |
| Reading and writing requests are created by CPU thread of QEMU. Later these |
| requests proceed to block layer which creates "bottom halves". Bottom |
| halves consist of callback and its parameters. They are processed when |
| main loop locks the BQL. These locks are not synchronized with |
| replaying process because main loop also processes the events that do not |
| affect the virtual machine state (like user interaction with monitor). |
| |
| That is why we had to implement saving and replaying bottom halves callbacks |
| synchronously to the CPU execution. When the callback is about to execute |
| it is added to the queue in the replay module. This queue is written to the |
| log when its callbacks are executed. In replay mode callbacks are not processed |
| until the corresponding event is read from the events log file. |
| |
| Sometimes the block layer uses asynchronous callbacks for its internal purposes |
| (like reading or writing VM snapshots or disk image cluster tables). In this |
| case bottom halves are not marked as "replayable" and do not saved |
| into the log. |
| |
| Saving/restoring the VM state |
| ----------------------------- |
| |
| Record/replay relies on VM state save and restore being complete and |
| deterministic. |
| |
| All fields in the device state structure (including virtual timers) |
| should be restored by loadvm to the same values they had before savevm. |
| |
| Avoid accessing other devices' state, because the order of saving/restoring |
| is not defined. It means that you should not call functions like |
| ``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid |
| the dependencies that may make restoring the VM state non-deterministic. |
| |
| Stopping the VM |
| --------------- |
| |
| Stopping the guest should not interfere with its state (with the exception |
| of the network connections, that could be broken by the remote timeouts). |
| VM can be stopped at any moment of replay by the user. Restarting the VM |
| after that stop should not break the replay by the unneeded guest state change. |
| |
| Replay log format |
| ================= |
| |
| Record/replay log consists of the header and the sequence of execution |
| events. The header includes 4-byte replay version id and 8-byte reserved |
| field. Version is updated every time replay log format changes to prevent |
| using replay log created by another build of qemu. |
| |
| The sequence of the events describes virtual machine state changes. |
| It includes all non-deterministic inputs of VM, synchronization marks and |
| instruction counts used to correctly inject inputs at replay. |
| |
| Synchronization marks (checkpoints) are used for synchronizing qemu threads |
| that perform operations with virtual hardware. These operations may change |
| system's state (e.g., change some register or generate interrupt) and |
| therefore should execute synchronously with CPU thread. |
| |
| Every event in the log includes 1-byte event id and optional arguments. |
| When argument is an array, it is stored as 4-byte array length |
| and corresponding number of bytes with data. |
| Here is the list of events that are written into the log: |
| |
| - EVENT_INSTRUCTION. Instructions executed since last event. Followed by: |
| |
| - 4-byte number of executed instructions. |
| |
| - EVENT_INTERRUPT. Used to synchronize interrupt processing. |
| - EVENT_EXCEPTION. Used to synchronize exception handling. |
| - EVENT_ASYNC. This is a group of events. When such an event is generated, |
| it is stored in the queue and processed in icount_account_warp_timer(). |
| Every such event has it's own id from the following list: |
| |
| - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes |
| callbacks that affect virtual machine state, but normally called |
| asynchronously. Followed by: |
| |
| - 8-byte operation id. |
| |
| - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains |
| parameters of keyboard and mouse input operations |
| (key press/release, mouse pointer movement). Followed by: |
| |
| - 9-16 bytes depending of input event. |
| |
| - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. |
| - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input |
| initiated by the sender. Followed by: |
| |
| - 1-byte character device id. |
| - Array with bytes were read. |
| |
| - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize |
| operations with disk and flash drives with CPU. Followed by: |
| |
| - 8-byte operation id. |
| |
| - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by: |
| |
| - 1-byte network adapter id. |
| - 4-byte packet flags. |
| - Array with packet bytes. |
| |
| - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, |
| e.g., by closing the window. |
| - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by: |
| |
| - 4-byte output function return value. |
| - 4-byte offset in the output array. |
| |
| - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, |
| initiated by qemu. Followed by: |
| |
| - Array with bytes that were read. |
| |
| - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, |
| initiated by qemu. Followed by: |
| |
| - 4-byte error code. |
| |
| - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by: |
| |
| - 8-byte clock value. |
| |
| - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of |
| CPU, internal threads, and asynchronous input events. |
| - EVENT_END. Last event in the log. |