| Copyright (c) 2010-2015 Institute for System Programming |
| of the Russian Academy of Sciences. |
| |
| This work is licensed under the terms of the GNU GPL, version 2 or later. |
| See the COPYING file in the top-level directory. |
| |
| Record/replay |
| ------------- |
| |
| Record/replay functions are used for the reverse execution and deterministic |
| replay of qemu execution. This implementation of deterministic replay can |
| be used for deterministic debugging of guest code through a gdb remote |
| interface. |
| |
| Execution recording writes a non-deterministic events log, which can be later |
| used for replaying the execution anywhere and for unlimited number of times. |
| It also supports checkpointing for faster rewinding during reverse debugging. |
| Execution replaying reads the log and replays all non-deterministic events |
| including external input, hardware clocks, and interrupts. |
| |
| Deterministic replay has the following features: |
| * Deterministically replays whole system execution and all contents of |
| the memory, state of the hardware devices, clocks, and screen of the VM. |
| * Writes execution log into the file for later replaying for multiple times |
| on different machines. |
| * Supports i386, x86_64, and ARM hardware platforms. |
| * Performs deterministic replay of all operations with keyboard and mouse |
| input devices. |
| |
| Usage of the record/replay: |
| * First, record the execution, by adding the following arguments to the command line: |
| '-icount shift=7,rr=record,rrfile=replay.bin -net none'. |
| Block devices' images are not actually changed in the recording mode, |
| because all of the changes are written to the temporary overlay file. |
| * Then you can replay it by using another command |
| line option: '-icount shift=7,rr=replay,rrfile=replay.bin -net none' |
| * '-net none' option should also be specified if network replay patches |
| are not applied. |
| |
| Papers with description of deterministic replay implementation: |
| http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html |
| http://dl.acm.org/citation.cfm?id=2786805.2803179 |
| |
| Modifications of qemu include: |
| * wrappers for clock and time functions to save their return values in the log |
| * saving different asynchronous events (e.g. system shutdown) into the log |
| * synchronization of the bottom halves execution |
| * synchronization of the threads from thread pool |
| * recording/replaying user input (mouse and keyboard) |
| * adding internal checkpoints for cpu and io synchronization |
| |
| Non-deterministic events |
| ------------------------ |
| |
| Our record/replay system is based on saving and replaying non-deterministic |
| events (e.g. keyboard input) and simulating deterministic ones (e.g. reading |
| from HDD or memory of the VM). Saving only non-deterministic events makes |
| log file smaller, simulation faster, and allows using reverse debugging even |
| for realtime applications. |
| |
| The following non-deterministic data from peripheral devices is saved into |
| the log: mouse and keyboard input, network packets, audio controller input, |
| USB packets, serial port input, and hardware clocks (they are non-deterministic |
| too, because their values are taken from the host machine). Inputs from |
| simulated hardware, memory of VM, software interrupts, and execution of |
| instructions are not saved into the log, because they are deterministic and |
| can be replayed by simulating the behavior of virtual machine starting from |
| initial state. |
| |
| We had to solve three tasks to implement deterministic replay: recording |
| non-deterministic events, replaying non-deterministic events, and checking |
| that there is no divergence between record and replay modes. |
| |
| We changed several parts of QEMU to make event log recording and replaying. |
| Devices' models that have non-deterministic input from external devices were |
| changed to write every external event into the execution log immediately. |
| E.g. network packets are written into the log when they arrive into the virtual |
| network adapter. |
| |
| All non-deterministic events are coming from these devices. But to |
| replay them we need to know at which moments they occur. We specify |
| these moments by counting the number of instructions executed between |
| every pair of consecutive events. |
| |
| Instruction counting |
| -------------------- |
| |
| QEMU should work in icount mode to use record/replay feature. icount was |
| designed to allow deterministic execution in absence of external inputs |
| of the virtual machine. We also use icount to control the occurrence of the |
| non-deterministic events. The number of instructions elapsed from the last event |
| is written to the log while recording the execution. In replay mode we |
| can predict when to inject that event using the instruction counter. |
| |
| Timers |
| ------ |
| |
| Timers are used to execute callbacks from different subsystems of QEMU |
| at the specified moments of time. There are several kinds of timers: |
| * Real time clock. Based on host time and used only for callbacks that |
| do not change the virtual machine state. For this reason real time |
| clock and timers does not affect deterministic replay at all. |
| * Virtual clock. These timers run only during the emulation. In icount |
| mode virtual clock value is calculated using executed instructions counter. |
| That is why it is completely deterministic and does not have to be recorded. |
| * Host clock. This clock is used by device models that simulate real time |
| sources (e.g. real time clock chip). Host clock is the one of the sources |
| of non-determinism. Host clock read operations should be logged to |
| make the execution deterministic. |
| * Virtual real time clock. This clock is similar to real time clock but |
| it is used only for increasing virtual clock while virtual machine is |
| sleeping. Due to its nature it is also non-deterministic as the host clock |
| and has to be logged too. |
| |
| Checkpoints |
| ----------- |
| |
| Replaying of the execution of virtual machine is bound by sources of |
| non-determinism. These are inputs from clock and peripheral devices, |
| and QEMU thread scheduling. Thread scheduling affect on processing events |
| from timers, asynchronous input-output, and bottom halves. |
| |
| Invocations of timers are coupled with clock reads and changing the state |
| of the virtual machine. Reads produce non-deterministic data taken from |
| host clock. And VM state changes should preserve their order. Their relative |
| order in replay mode must replicate the order of callbacks in record mode. |
| To preserve this order we use checkpoints. When a specific clock is processed |
| in record mode we save to the log special "checkpoint" event. |
| Checkpoints here do not refer to virtual machine snapshots. They are just |
| record/replay events used for synchronization. |
| |
| QEMU in replay mode will try to invoke timers processing in random moment |
| of time. That's why we do not process a group of timers until the checkpoint |
| event will be read from the log. Such an event allows synchronizing CPU |
| execution and timer events. |
| |
| Two other checkpoints govern the "warping" of the virtual clock. |
| While the virtual machine is idle, the virtual clock increments at |
| 1 ns per *real time* nanosecond. This is done by setting up a timer |
| (called the warp timer) on the virtual real time clock, so that the |
| timer fires at the next deadline of the virtual clock; the virtual clock |
| is then incremented (which is called "warping" the virtual clock) as |
| soon as the timer fires or the CPUs need to go out of the idle state. |
| Two functions are used for this purpose; because these actions change |
| virtual machine state and must be deterministic, each of them creates a |
| checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so |
| starts accounting real time to virtual clock. qemu_account_warp_timer |
| is called when the CPUs get an interrupt or when the warp timer fires, |
| and it warps the virtual clock by the amount of real time that has passed |
| since qemu_start_warp_timer. |
| |
| Bottom halves |
| ------------- |
| |
| Disk I/O events are completely deterministic in our model, because |
| in both record and replay modes we start virtual machine from the same |
| disk state. But callbacks that virtual disk controller uses for reading and |
| writing the disk may occur at different moments of time in record and replay |
| modes. |
| |
| Reading and writing requests are created by CPU thread of QEMU. Later these |
| requests proceed to block layer which creates "bottom halves". Bottom |
| halves consist of callback and its parameters. They are processed when |
| main loop locks the global mutex. These locks are not synchronized with |
| replaying process because main loop also processes the events that do not |
| affect the virtual machine state (like user interaction with monitor). |
| |
| That is why we had to implement saving and replaying bottom halves callbacks |
| synchronously to the CPU execution. When the callback is about to execute |
| it is added to the queue in the replay module. This queue is written to the |
| log when its callbacks are executed. In replay mode callbacks are not processed |
| until the corresponding event is read from the events log file. |
| |
| Sometimes the block layer uses asynchronous callbacks for its internal purposes |
| (like reading or writing VM snapshots or disk image cluster tables). In this |
| case bottom halves are not marked as "replayable" and do not saved |
| into the log. |
| |
| Block devices |
| ------------- |
| |
| Block devices record/replay module intercepts calls of |
| bdrv coroutine functions at the top of block drivers stack. |
| To record and replay block operations the drive must be configured |
| as following: |
| -drive file=disk.qcow,if=none,id=img-direct |
| -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay |
| -device ide-hd,drive=img-blkreplay |
| |
| blkreplay driver should be inserted between disk image and virtual driver |
| controller. Therefore all disk requests may be recorded and replayed. |
| |
| All block completion operations are added to the queue in the coroutines. |
| Queue is flushed at checkpoints and information about processed requests |
| is recorded to the log. In replay phase the queue is matched with |
| events read from the log. Therefore block devices requests are processed |
| deterministically. |
| |
| Snapshotting |
| ------------ |
| |
| New VM snapshots may be created in replay mode. They can be used later |
| to recover the desired VM state. All VM states created in replay mode |
| are associated with the moment of time in the replay scenario. |
| After recovering the VM state replay will start from that position. |
| |
| Default starting snapshot name may be specified with icount field |
| rrsnapshot as follows: |
| -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name |
| |
| This snapshot is created at start of recording and restored at start |
| of replaying. It also can be loaded while replaying to roll back |
| the execution. |
| |
| Network devices |
| --------------- |
| |
| Record and replay for network interactions is performed with the network filter. |
| Each backend must have its own instance of the replay filter as follows: |
| -netdev user,id=net1 -device rtl8139,netdev=net1 |
| -object filter-replay,id=replay,netdev=net1 |
| |
| Replay network filter is used to record and replay network packets. While |
| recording the virtual machine this filter puts all packets coming from |
| the outer world into the log. In replay mode packets from the log are |
| injected into the network device. All interactions with network backend |
| in replay mode are disabled. |