blob: 70c27edb362105cf5886141d048f6b885763c244 [file] [log] [blame]
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +03001Copyright (c) 2010-2015 Institute for System Programming
2 of the Russian Academy of Sciences.
3
4This work is licensed under the terms of the GNU GPL, version 2 or later.
5See the COPYING file in the top-level directory.
6
7Record/replay
8-------------
9
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030010Record/replay functions are used for the deterministic replay of qemu execution.
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030011Execution recording writes a non-deterministic events log, which can be later
12used for replaying the execution anywhere and for unlimited number of times.
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030013It also supports checkpointing for faster rewind to the specific replay moment.
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030014Execution replaying reads the log and replays all non-deterministic events
15including external input, hardware clocks, and interrupts.
16
17Deterministic replay has the following features:
18 * Deterministically replays whole system execution and all contents of
19 the memory, state of the hardware devices, clocks, and screen of the VM.
20 * Writes execution log into the file for later replaying for multiple times
21 on different machines.
Peter Maydell6fe6d6c2020-03-09 21:58:18 +000022 * Supports i386, x86_64, and Arm hardware platforms.
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030023 * Performs deterministic replay of all operations with keyboard and mouse
24 input devices.
25
26Usage of the record/replay:
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030027 * First, record the execution with the following command line:
28 qemu-system-i386 \
29 -icount shift=7,rr=record,rrfile=replay.bin \
Pavel Dovgalyukde499eb2019-09-17 14:58:02 +030030 -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030031 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
32 -device ide-hd,drive=img-blkreplay \
33 -netdev user,id=net1 -device rtl8139,netdev=net1 \
34 -object filter-replay,id=replay,netdev=net1
35 * After recording, you can replay it by using another command line:
36 qemu-system-i386 \
37 -icount shift=7,rr=replay,rrfile=replay.bin \
Pavel Dovgalyukde499eb2019-09-17 14:58:02 +030038 -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030039 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
40 -device ide-hd,drive=img-blkreplay \
41 -netdev user,id=net1 -device rtl8139,netdev=net1 \
42 -object filter-replay,id=replay,netdev=net1
43 The only difference with recording is changing the rr option
44 from record to replay.
45 * Block device images are not actually changed in the recording mode,
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030046 because all of the changes are written to the temporary overlay file.
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030047 This behavior is enabled by using blkreplay driver. It should be used
48 for every enabled block device, as described in 'Block devices' section.
49 * '-net none' option should be specified when network is not used,
50 because QEMU adds network card by default. When network is needed,
51 it should be configured explicitly with replay filter, as described
52 in 'Network devices' section.
53 * Interaction with audio devices and serial ports are recorded and replayed
54 automatically when such devices are enabled.
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030055
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030056Academic papers with description of deterministic replay implementation:
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030057http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
58http://dl.acm.org/citation.cfm?id=2786805.2803179
59
60Modifications of qemu include:
61 * wrappers for clock and time functions to save their return values in the log
62 * saving different asynchronous events (e.g. system shutdown) into the log
63 * synchronization of the bottom halves execution
64 * synchronization of the threads from thread pool
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030065 * recording/replaying user input (mouse, keyboard, and microphone)
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030066 * adding internal checkpoints for cpu and io synchronization
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +030067 * network filter for recording and replaying the packets
68 * block driver for making block layer deterministic
69 * serial port input record and replay
Pavel Dovgalyuk878ec292019-12-19 15:50:48 +030070 * recording of random numbers obtained from the external sources
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030071
Alex Bennéed759c952018-02-27 12:52:48 +030072Locking and thread synchronisation
73----------------------------------
74
75Previously the synchronisation of the main thread and the vCPU thread
76was ensured by the holding of the BQL. However the trend has been to
77reduce the time the BQL was held across the system including under TCG
78system emulation. As it is important that batches of events are kept
79in sequence (e.g. expiring timers and checkpoints in the main thread
80while instruction checkpoints are written by the vCPU thread) we need
81another lock to keep things in lock-step. This role is now handled by
82the replay_mutex_lock. It used to be held only for each event being
83written but now it is held for a whole execution period. This results
84in a deterministic ping-pong between the two main threads.
85
86As the BQL is now a finer grained lock than the replay_lock it is almost
87certainly a bug, and a source of deadlocks, to take the
88replay_mutex_lock while the BQL is held. This is enforced by an assert.
89While the unlocks are usually in the reverse order, this is not
90necessary; you can drop the replay_lock while holding the BQL, without
91doing a more complicated unlock_iothread/replay_unlock/lock_iothread
92sequence.
93
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +030094Non-deterministic events
95------------------------
96
97Our record/replay system is based on saving and replaying non-deterministic
98events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
99from HDD or memory of the VM). Saving only non-deterministic events makes
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +0300100log file smaller and simulation faster.
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +0300101
102The following non-deterministic data from peripheral devices is saved into
103the log: mouse and keyboard input, network packets, audio controller input,
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +0300104serial port input, and hardware clocks (they are non-deterministic
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +0300105too, because their values are taken from the host machine). Inputs from
106simulated hardware, memory of VM, software interrupts, and execution of
107instructions are not saved into the log, because they are deterministic and
108can be replayed by simulating the behavior of virtual machine starting from
109initial state.
110
111We had to solve three tasks to implement deterministic replay: recording
112non-deterministic events, replaying non-deterministic events, and checking
113that there is no divergence between record and replay modes.
114
115We changed several parts of QEMU to make event log recording and replaying.
116Devices' models that have non-deterministic input from external devices were
117changed to write every external event into the execution log immediately.
118E.g. network packets are written into the log when they arrive into the virtual
119network adapter.
120
121All non-deterministic events are coming from these devices. But to
122replay them we need to know at which moments they occur. We specify
123these moments by counting the number of instructions executed between
124every pair of consecutive events.
125
126Instruction counting
127--------------------
128
129QEMU should work in icount mode to use record/replay feature. icount was
130designed to allow deterministic execution in absence of external inputs
131of the virtual machine. We also use icount to control the occurrence of the
132non-deterministic events. The number of instructions elapsed from the last event
133is written to the log while recording the execution. In replay mode we
134can predict when to inject that event using the instruction counter.
135
136Timers
137------
138
139Timers are used to execute callbacks from different subsystems of QEMU
140at the specified moments of time. There are several kinds of timers:
141 * Real time clock. Based on host time and used only for callbacks that
142 do not change the virtual machine state. For this reason real time
143 clock and timers does not affect deterministic replay at all.
144 * Virtual clock. These timers run only during the emulation. In icount
145 mode virtual clock value is calculated using executed instructions counter.
146 That is why it is completely deterministic and does not have to be recorded.
147 * Host clock. This clock is used by device models that simulate real time
148 sources (e.g. real time clock chip). Host clock is the one of the sources
149 of non-determinism. Host clock read operations should be logged to
150 make the execution deterministic.
Pavel Dovgalyuke76d1792016-03-10 14:56:09 +0300151 * Virtual real time clock. This clock is similar to real time clock but
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +0300152 it is used only for increasing virtual clock while virtual machine is
153 sleeping. Due to its nature it is also non-deterministic as the host clock
154 and has to be logged too.
155
156Checkpoints
157-----------
158
159Replaying of the execution of virtual machine is bound by sources of
160non-determinism. These are inputs from clock and peripheral devices,
161and QEMU thread scheduling. Thread scheduling affect on processing events
162from timers, asynchronous input-output, and bottom halves.
163
164Invocations of timers are coupled with clock reads and changing the state
165of the virtual machine. Reads produce non-deterministic data taken from
166host clock. And VM state changes should preserve their order. Their relative
167order in replay mode must replicate the order of callbacks in record mode.
168To preserve this order we use checkpoints. When a specific clock is processed
169in record mode we save to the log special "checkpoint" event.
170Checkpoints here do not refer to virtual machine snapshots. They are just
171record/replay events used for synchronization.
172
173QEMU in replay mode will try to invoke timers processing in random moment
174of time. That's why we do not process a group of timers until the checkpoint
175event will be read from the log. Such an event allows synchronizing CPU
176execution and timer events.
177
Pavel Dovgalyuke76d1792016-03-10 14:56:09 +0300178Two other checkpoints govern the "warping" of the virtual clock.
179While the virtual machine is idle, the virtual clock increments at
1801 ns per *real time* nanosecond. This is done by setting up a timer
181(called the warp timer) on the virtual real time clock, so that the
182timer fires at the next deadline of the virtual clock; the virtual clock
183is then incremented (which is called "warping" the virtual clock) as
184soon as the timer fires or the CPUs need to go out of the idle state.
185Two functions are used for this purpose; because these actions change
186virtual machine state and must be deterministic, each of them creates a
187checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so
188starts accounting real time to virtual clock. qemu_account_warp_timer
189is called when the CPUs get an interrupt or when the warp timer fires,
190and it warps the virtual clock by the amount of real time that has passed
191since qemu_start_warp_timer.
Pavel Dovgalyukd73abd62015-09-17 19:23:37 +0300192
193Bottom halves
194-------------
195
196Disk I/O events are completely deterministic in our model, because
197in both record and replay modes we start virtual machine from the same
198disk state. But callbacks that virtual disk controller uses for reading and
199writing the disk may occur at different moments of time in record and replay
200modes.
201
202Reading and writing requests are created by CPU thread of QEMU. Later these
203requests proceed to block layer which creates "bottom halves". Bottom
204halves consist of callback and its parameters. They are processed when
205main loop locks the global mutex. These locks are not synchronized with
206replaying process because main loop also processes the events that do not
207affect the virtual machine state (like user interaction with monitor).
208
209That is why we had to implement saving and replaying bottom halves callbacks
210synchronously to the CPU execution. When the callback is about to execute
211it is added to the queue in the replay module. This queue is written to the
212log when its callbacks are executed. In replay mode callbacks are not processed
213until the corresponding event is read from the events log file.
214
215Sometimes the block layer uses asynchronous callbacks for its internal purposes
216(like reading or writing VM snapshots or disk image cluster tables). In this
217case bottom halves are not marked as "replayable" and do not saved
218into the log.
Pavel Dovgalyuk63785672016-03-14 10:45:10 +0300219
220Block devices
221-------------
222
223Block devices record/replay module intercepts calls of
224bdrv coroutine functions at the top of block drivers stack.
225To record and replay block operations the drive must be configured
226as following:
Pavel Dovgalyukde499eb2019-09-17 14:58:02 +0300227 -drive file=disk.qcow2,if=none,snapshot,id=img-direct
Pavel Dovgalyuk63785672016-03-14 10:45:10 +0300228 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
229 -device ide-hd,drive=img-blkreplay
230
231blkreplay driver should be inserted between disk image and virtual driver
232controller. Therefore all disk requests may be recorded and replayed.
233
234All block completion operations are added to the queue in the coroutines.
235Queue is flushed at checkpoints and information about processed requests
236is recorded to the log. In replay phase the queue is matched with
237events read from the log. Therefore block devices requests are processed
238deterministically.
Pavel Dovgalyuk646c5472016-09-26 11:08:21 +0300239
Pavel Dovgalyuk9c2037d2017-01-24 10:17:47 +0300240Snapshotting
241------------
242
243New VM snapshots may be created in replay mode. They can be used later
244to recover the desired VM state. All VM states created in replay mode
245are associated with the moment of time in the replay scenario.
246After recovering the VM state replay will start from that position.
247
248Default starting snapshot name may be specified with icount field
249rrsnapshot as follows:
250 -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
251
252This snapshot is created at start of recording and restored at start
253of replaying. It also can be loaded while replaying to roll back
254the execution.
255
Pavel Dovgalyukde499eb2019-09-17 14:58:02 +0300256'snapshot' flag of the disk image must be removed to save the snapshots
257in the overlay (or original image) instead of using the temporary overlay.
258 -drive file=disk.ovl,if=none,id=img-direct
259 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
260 -device ide-hd,drive=img-blkreplay
261
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +0300262Use QEMU monitor to create additional snapshots. 'savevm <name>' command
263created the snapshot and 'loadvm <name>' restores it. To prevent corruption
264of the original disk image, use overlay files linked to the original images.
265Therefore all new snapshots (including the starting one) will be saved in
266overlays and the original image remains unchanged.
267
Pavel Dovgalyuk646c5472016-09-26 11:08:21 +0300268Network devices
269---------------
270
271Record and replay for network interactions is performed with the network filter.
272Each backend must have its own instance of the replay filter as follows:
273 -netdev user,id=net1 -device rtl8139,netdev=net1
274 -object filter-replay,id=replay,netdev=net1
275
276Replay network filter is used to record and replay network packets. While
277recording the virtual machine this filter puts all packets coming from
278the outer world into the log. In replay mode packets from the log are
279injected into the network device. All interactions with network backend
280in replay mode are disabled.
Pavel Dovgalyuk3d4d16f2017-02-02 08:50:54 +0300281
282Audio devices
283-------------
284
285Audio data is recorded and replay automatically. The command line for recording
286and replaying must contain identical specifications of audio hardware, e.g.:
287 -soundhw ac97
Pavel Dovgalyukbb040e02018-02-27 12:52:20 +0300288
Pavel Dovgalyuk7273db92018-02-27 12:53:33 +0300289Serial ports
290------------
291
292Serial ports input is recorded and replay automatically. The command lines
293for recording and replaying must contain identical number of ports in record
294and replay modes, but their backends may differ.
295E.g., '-serial stdio' in record mode, and '-serial null' in replay mode.
296
Pavel Dovgalyukbb040e02018-02-27 12:52:20 +0300297Replay log format
298-----------------
299
Like Xu806be372019-02-20 13:27:26 +0800300Record/replay log consists of the header and the sequence of execution
Pavel Dovgalyukbb040e02018-02-27 12:52:20 +0300301events. The header includes 4-byte replay version id and 8-byte reserved
302field. Version is updated every time replay log format changes to prevent
303using replay log created by another build of qemu.
304
305The sequence of the events describes virtual machine state changes.
306It includes all non-deterministic inputs of VM, synchronization marks and
307instruction counts used to correctly inject inputs at replay.
308
309Synchronization marks (checkpoints) are used for synchronizing qemu threads
310that perform operations with virtual hardware. These operations may change
311system's state (e.g., change some register or generate interrupt) and
312therefore should execute synchronously with CPU thread.
313
314Every event in the log includes 1-byte event id and optional arguments.
315When argument is an array, it is stored as 4-byte array length
316and corresponding number of bytes with data.
317Here is the list of events that are written into the log:
318
319 - EVENT_INSTRUCTION. Instructions executed since last event.
320 Argument: 4-byte number of executed instructions.
321 - EVENT_INTERRUPT. Used to synchronize interrupt processing.
322 - EVENT_EXCEPTION. Used to synchronize exception handling.
323 - EVENT_ASYNC. This is a group of events. They are always processed
324 together with checkpoints. When such an event is generated, it is
325 stored in the queue and processed only when checkpoint occurs.
326 Every such event is followed by 1-byte checkpoint id and 1-byte
327 async event id from the following list:
328 - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
329 callbacks that affect virtual machine state, but normally called
Stefan Weil963e64a2018-07-13 14:17:27 +0200330 asynchronously.
Pavel Dovgalyukbb040e02018-02-27 12:52:20 +0300331 Argument: 8-byte operation id.
332 - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
333 parameters of keyboard and mouse input operations
334 (key press/release, mouse pointer movement).
335 Arguments: 9-16 bytes depending of input event.
336 - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
337 - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
338 initiated by the sender.
339 Arguments: 1-byte character device id.
340 Array with bytes were read.
341 - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
342 operations with disk and flash drives with CPU.
343 Argument: 8-byte operation id.
344 - REPLAY_ASYNC_EVENT_NET. Incoming network packet.
345 Arguments: 1-byte network adapter id.
346 4-byte packet flags.
347 Array with packet bytes.
348 - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
349 e.g., by closing the window.
350 - EVENT_CHAR_WRITE. Used to synchronize character output operations.
351 Arguments: 4-byte output function return value.
352 4-byte offset in the output array.
353 - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
354 initiated by qemu.
355 Argument: Array with bytes that were read.
356 - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
357 initiated by qemu.
358 Argument: 4-byte error code.
359 - EVENT_CLOCK + clock_id. Group of events for host clock read operations.
360 Argument: 8-byte clock value.
361 - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
362 CPU, internal threads, and asynchronous input events. May be followed
363 by one or more EVENT_ASYNC events.
364 - EVENT_END. Last event in the log.