Pavel Dovgalyuk | d73abd6 | 2015-09-17 19:23:37 +0300 | [diff] [blame] | 1 | Copyright (c) 2010-2015 Institute for System Programming |
| 2 | of the Russian Academy of Sciences. |
| 3 | |
| 4 | This work is licensed under the terms of the GNU GPL, version 2 or later. |
| 5 | See the COPYING file in the top-level directory. |
| 6 | |
| 7 | Record/replay |
| 8 | ------------- |
| 9 | |
| 10 | Record/replay functions are used for the reverse execution and deterministic |
| 11 | replay of qemu execution. This implementation of deterministic replay can |
| 12 | be used for deterministic debugging of guest code through a gdb remote |
| 13 | interface. |
| 14 | |
| 15 | Execution recording writes a non-deterministic events log, which can be later |
| 16 | used for replaying the execution anywhere and for unlimited number of times. |
| 17 | It also supports checkpointing for faster rewinding during reverse debugging. |
| 18 | Execution replaying reads the log and replays all non-deterministic events |
| 19 | including external input, hardware clocks, and interrupts. |
| 20 | |
| 21 | Deterministic replay has the following features: |
| 22 | * Deterministically replays whole system execution and all contents of |
| 23 | the memory, state of the hardware devices, clocks, and screen of the VM. |
| 24 | * Writes execution log into the file for later replaying for multiple times |
| 25 | on different machines. |
| 26 | * Supports i386, x86_64, and ARM hardware platforms. |
| 27 | * Performs deterministic replay of all operations with keyboard and mouse |
| 28 | input devices. |
| 29 | |
| 30 | Usage of the record/replay: |
| 31 | * First, record the execution, by adding the following arguments to the command line: |
| 32 | '-icount shift=7,rr=record,rrfile=replay.bin -net none'. |
| 33 | Block devices' images are not actually changed in the recording mode, |
| 34 | because all of the changes are written to the temporary overlay file. |
| 35 | * Then you can replay it by using another command |
| 36 | line option: '-icount shift=7,rr=replay,rrfile=replay.bin -net none' |
| 37 | * '-net none' option should also be specified if network replay patches |
| 38 | are not applied. |
| 39 | |
| 40 | Papers with description of deterministic replay implementation: |
| 41 | http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html |
| 42 | http://dl.acm.org/citation.cfm?id=2786805.2803179 |
| 43 | |
| 44 | Modifications of qemu include: |
| 45 | * wrappers for clock and time functions to save their return values in the log |
| 46 | * saving different asynchronous events (e.g. system shutdown) into the log |
| 47 | * synchronization of the bottom halves execution |
| 48 | * synchronization of the threads from thread pool |
| 49 | * recording/replaying user input (mouse and keyboard) |
| 50 | * adding internal checkpoints for cpu and io synchronization |
| 51 | |
| 52 | Non-deterministic events |
| 53 | ------------------------ |
| 54 | |
| 55 | Our record/replay system is based on saving and replaying non-deterministic |
| 56 | events (e.g. keyboard input) and simulating deterministic ones (e.g. reading |
| 57 | from HDD or memory of the VM). Saving only non-deterministic events makes |
| 58 | log file smaller, simulation faster, and allows using reverse debugging even |
| 59 | for realtime applications. |
| 60 | |
| 61 | The following non-deterministic data from peripheral devices is saved into |
| 62 | the log: mouse and keyboard input, network packets, audio controller input, |
| 63 | USB packets, serial port input, and hardware clocks (they are non-deterministic |
| 64 | too, because their values are taken from the host machine). Inputs from |
| 65 | simulated hardware, memory of VM, software interrupts, and execution of |
| 66 | instructions are not saved into the log, because they are deterministic and |
| 67 | can be replayed by simulating the behavior of virtual machine starting from |
| 68 | initial state. |
| 69 | |
| 70 | We had to solve three tasks to implement deterministic replay: recording |
| 71 | non-deterministic events, replaying non-deterministic events, and checking |
| 72 | that there is no divergence between record and replay modes. |
| 73 | |
| 74 | We changed several parts of QEMU to make event log recording and replaying. |
| 75 | Devices' models that have non-deterministic input from external devices were |
| 76 | changed to write every external event into the execution log immediately. |
| 77 | E.g. network packets are written into the log when they arrive into the virtual |
| 78 | network adapter. |
| 79 | |
| 80 | All non-deterministic events are coming from these devices. But to |
| 81 | replay them we need to know at which moments they occur. We specify |
| 82 | these moments by counting the number of instructions executed between |
| 83 | every pair of consecutive events. |
| 84 | |
| 85 | Instruction counting |
| 86 | -------------------- |
| 87 | |
| 88 | QEMU should work in icount mode to use record/replay feature. icount was |
| 89 | designed to allow deterministic execution in absence of external inputs |
| 90 | of the virtual machine. We also use icount to control the occurrence of the |
| 91 | non-deterministic events. The number of instructions elapsed from the last event |
| 92 | is written to the log while recording the execution. In replay mode we |
| 93 | can predict when to inject that event using the instruction counter. |
| 94 | |
| 95 | Timers |
| 96 | ------ |
| 97 | |
| 98 | Timers are used to execute callbacks from different subsystems of QEMU |
| 99 | at the specified moments of time. There are several kinds of timers: |
| 100 | * Real time clock. Based on host time and used only for callbacks that |
| 101 | do not change the virtual machine state. For this reason real time |
| 102 | clock and timers does not affect deterministic replay at all. |
| 103 | * Virtual clock. These timers run only during the emulation. In icount |
| 104 | mode virtual clock value is calculated using executed instructions counter. |
| 105 | That is why it is completely deterministic and does not have to be recorded. |
| 106 | * Host clock. This clock is used by device models that simulate real time |
| 107 | sources (e.g. real time clock chip). Host clock is the one of the sources |
| 108 | of non-determinism. Host clock read operations should be logged to |
| 109 | make the execution deterministic. |
| 110 | * Real time clock for icount. This clock is similar to real time clock but |
| 111 | it is used only for increasing virtual clock while virtual machine is |
| 112 | sleeping. Due to its nature it is also non-deterministic as the host clock |
| 113 | and has to be logged too. |
| 114 | |
| 115 | Checkpoints |
| 116 | ----------- |
| 117 | |
| 118 | Replaying of the execution of virtual machine is bound by sources of |
| 119 | non-determinism. These are inputs from clock and peripheral devices, |
| 120 | and QEMU thread scheduling. Thread scheduling affect on processing events |
| 121 | from timers, asynchronous input-output, and bottom halves. |
| 122 | |
| 123 | Invocations of timers are coupled with clock reads and changing the state |
| 124 | of the virtual machine. Reads produce non-deterministic data taken from |
| 125 | host clock. And VM state changes should preserve their order. Their relative |
| 126 | order in replay mode must replicate the order of callbacks in record mode. |
| 127 | To preserve this order we use checkpoints. When a specific clock is processed |
| 128 | in record mode we save to the log special "checkpoint" event. |
| 129 | Checkpoints here do not refer to virtual machine snapshots. They are just |
| 130 | record/replay events used for synchronization. |
| 131 | |
| 132 | QEMU in replay mode will try to invoke timers processing in random moment |
| 133 | of time. That's why we do not process a group of timers until the checkpoint |
| 134 | event will be read from the log. Such an event allows synchronizing CPU |
| 135 | execution and timer events. |
| 136 | |
| 137 | Another checkpoints application in record/replay is instruction counting |
| 138 | while the virtual machine is idle. This function (qemu_clock_warp) is called |
| 139 | from the wait loop. It changes virtual machine state and must be deterministic |
| 140 | then. That is why we added checkpoint to this function to prevent its |
| 141 | operation in replay mode when it does not correspond to record mode. |
| 142 | |
| 143 | Bottom halves |
| 144 | ------------- |
| 145 | |
| 146 | Disk I/O events are completely deterministic in our model, because |
| 147 | in both record and replay modes we start virtual machine from the same |
| 148 | disk state. But callbacks that virtual disk controller uses for reading and |
| 149 | writing the disk may occur at different moments of time in record and replay |
| 150 | modes. |
| 151 | |
| 152 | Reading and writing requests are created by CPU thread of QEMU. Later these |
| 153 | requests proceed to block layer which creates "bottom halves". Bottom |
| 154 | halves consist of callback and its parameters. They are processed when |
| 155 | main loop locks the global mutex. These locks are not synchronized with |
| 156 | replaying process because main loop also processes the events that do not |
| 157 | affect the virtual machine state (like user interaction with monitor). |
| 158 | |
| 159 | That is why we had to implement saving and replaying bottom halves callbacks |
| 160 | synchronously to the CPU execution. When the callback is about to execute |
| 161 | it is added to the queue in the replay module. This queue is written to the |
| 162 | log when its callbacks are executed. In replay mode callbacks are not processed |
| 163 | until the corresponding event is read from the events log file. |
| 164 | |
| 165 | Sometimes the block layer uses asynchronous callbacks for its internal purposes |
| 166 | (like reading or writing VM snapshots or disk image cluster tables). In this |
| 167 | case bottom halves are not marked as "replayable" and do not saved |
| 168 | into the log. |