blob: e1b28a6cc1178a9ad6f9b39b3d75237941b3a725 [file] [log] [blame]
Wen Congyang68365a32016-07-27 15:01:46 +08001Block replication
2----------------------------------------
3Copyright Fujitsu, Corp. 2016
4Copyright (c) 2016 Intel Corporation
5Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
6
7This work is licensed under the terms of the GNU GPL, version 2 or later.
8See the COPYING file in the top-level directory.
9
10Block replication is used for continuous checkpoints. It is designed
11for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
12It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
13where the Secondary VM is not running.
14
15This document gives an overview of block replication's design.
16
17== Background ==
18High availability solutions such as micro checkpoint and COLO will do
19consecutive checkpoints. The VM state of the Primary and Secondary VM is
20identical right after a VM checkpoint, but becomes different as the VM
21executes till the next checkpoint. To support disk contents checkpoint,
22the modified disk contents in the Secondary VM must be buffered, and are
23only dropped at next checkpoint time. To reduce the network transportation
24effort during a vmstate checkpoint, the disk modification operations of
25the Primary disk are asynchronously forwarded to the Secondary node.
26
27== Workflow ==
28The following is the image of block replication workflow:
29
30 +----------------------+ +------------------------+
31 |Primary Write Requests| |Secondary Write Requests|
32 +----------------------+ +------------------------+
33 | |
34 | (4)
35 | V
36 | /-------------\
37 | Copy and Forward | |
38 |---------(1)----------+ | Disk Buffer |
39 | | | |
40 | (3) \-------------/
41 | speculative ^
42 | write through (2)
43 | | |
44 V V |
45 +--------------+ +----------------+
46 | Primary Disk | | Secondary Disk |
47 +--------------+ +----------------+
48
49 1) Primary write requests will be copied and forwarded to Secondary
50 QEMU.
51 2) Before Primary write requests are written to Secondary disk, the
52 original sector content will be read from Secondary disk and
53 buffered in the Disk buffer, but it will not overwrite the existing
54 sector content (it could be from either "Secondary Write Requests" or
55 previous COW of "Primary Write Requests") in the Disk buffer.
56 3) Primary write requests will be written to Secondary disk.
57 4) Secondary write requests will be buffered in the Disk buffer and it
58 will overwrite the existing sector content in the buffer.
59
60== Architecture ==
61We are going to implement block replication from many basic
62blocks that are already in QEMU.
63
64 virtio-blk ||
65 ^ || .----------
66 | || | Secondary
67 1 Quorum || '----------
Lukas Straub90dfe592019-10-24 16:25:57 +020068 / \ || virtio-blk
69 / \ || ^
70 Primary 2 filter |
71 disk ^ 7 Quorum
72 | /
73 3 NBD -------> 3 NBD /
Wen Congyang68365a32016-07-27 15:01:46 +080074 client || server 2 filter
75 || ^ ^
76--------. || | |
77Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
78--------' || | backing ^ backing
79 || | |
80 || | |
81 || '-------------------------'
Vladimir Sementsov-Ogievskiy9a599212021-11-04 09:58:09 +010082 || blockdev-backup sync=none 6
Wen Congyang68365a32016-07-27 15:01:46 +080083
841) The disk on the primary is represented by a block device with two
85children, providing replication between a primary disk and the host that
86runs the secondary VM. The read pattern (fifo) for quorum can be extended
87to make the primary always read from the local disk instead of going through
88NBD.
89
902) The new block filter (the name is replication) will control the block
91replication.
92
933) The secondary disk receives writes from the primary VM through QEMU's
94embedded NBD server (speculative write-through).
95
964) The disk on the secondary is represented by a custom block device
97(called active-disk). It should start as an empty disk, and the format
98should support bdrv_make_empty() and backing file.
99
1005) The hidden-disk is created automatically. It buffers the original content
101that is modified by the primary VM. It should also start as an empty disk,
102and the driver supports bdrv_make_empty() and backing file.
103
Vladimir Sementsov-Ogievskiy9a599212021-11-04 09:58:09 +01001046) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer
Wen Congyang68365a32016-07-27 15:01:46 +0800105any state that would otherwise be lost by the speculative write-through
106of the NBD server into the secondary disk. So before block replication,
107the primary disk and secondary disk should contain the same data.
108
Lukas Straub90dfe592019-10-24 16:25:57 +02001097) The secondary also has a quorum node, so after secondary failover it
110can become the new primary and continue replication.
111
112
Wen Congyang68365a32016-07-27 15:01:46 +0800113== Failure Handling ==
114There are 7 internal errors when block replication is running:
1151. I/O error on primary disk
1162. Forwarding primary write requests failed
1173. Backup failed
1184. I/O error on secondary disk
1195. I/O error on active disk
1206. Making active disk or hidden disk empty failed
1217. Doing failover failed
122In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
1234 and 6, we just report block replication's error to FT/HA manager (which
124decides when to do a new checkpoint, when to do failover).
125In case 7, if active commit failed, we use replication failover failed state
126in Secondary's write operation (what decides which target to write).
127
128== New block driver interface ==
129We add four block driver interfaces to control block replication:
130a. replication_start_all()
131 Start block replication, called in migration/checkpoint thread.
132 We must call block_replication_start_all() in secondary QEMU before
133 calling block_replication_start_all() in primary QEMU. The caller
134 must hold the I/O mutex lock if it is in migration/checkpoint
135 thread.
136b. replication_do_checkpoint_all()
137 This interface is called after all VM state is transferred to
138 Secondary QEMU. The Disk buffer will be dropped in this interface.
139 The caller must hold the I/O mutex lock if it is in migration/checkpoint
140 thread.
141c. replication_get_error_all()
142 This interface is called to check if error happened in replication.
143 The caller must hold the I/O mutex lock if it is in migration/checkpoint
144 thread.
145d. replication_stop_all()
146 It is called on failover. We will flush the Disk buffer into
147 Secondary Disk and stop block replication. The vm should be stopped
148 before calling it if you use this API to shutdown the guest, or other
149 things except failover. The caller must hold the I/O mutex lock if it is
150 in migration/checkpoint thread.
151
152== Usage ==
153Primary:
154 -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
155 children.0.file.filename=1.raw,\
156 children.0.driver=raw
157
158 Run qmp command in primary qemu:
Rao, Leieff708a2021-11-22 15:49:47 +0800159 { "execute": "human-monitor-command",
160 "arguments": {
161 "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1"
Wen Congyang68365a32016-07-27 15:01:46 +0800162 }
163 }
Rao, Leieff708a2021-11-22 15:49:47 +0800164 { "execute": "x-blockdev-change",
165 "arguments": {
166 "parent": "colo1",
167 "node": "nbd_client1"
Wen Congyang68365a32016-07-27 15:01:46 +0800168 }
169 }
170 Note:
171 1. There should be only one NBD Client for each primary disk.
172 2. host is the secondary physical machine's hostname or IP
173 3. Each disk must have its own export name.
174 4. It is all a single argument to -drive and you should ignore the
175 leading whitespace.
176 5. The qmp command line must be run after running qmp command line in
177 secondary qemu.
Lukas Straub90dfe592019-10-24 16:25:57 +0200178 6. After primary failover we need remove children.1 (replication driver).
Wen Congyang68365a32016-07-27 15:01:46 +0800179
180Secondary:
181 -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
Zhang Chen036ef342021-10-18 16:50:44 +0800182 -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1
Wen Congyang68365a32016-07-27 15:01:46 +0800183 file.file.filename=active_disk.qcow2,\
184 file.driver=qcow2,\
185 file.backing.file.filename=hidden_disk.qcow2,\
186 file.backing.driver=qcow2,\
187 file.backing.backing=colo1
Lukas Straub90dfe592019-10-24 16:25:57 +0200188 -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\
189 vote-threshold=1,children.0=childs1
Wen Congyang68365a32016-07-27 15:01:46 +0800190
191 Then run qmp command in secondary qemu:
Rao, Leieff708a2021-11-22 15:49:47 +0800192 { "execute": "nbd-server-start",
193 "arguments": {
194 "addr": {
195 "type": "inet",
196 "data": {
197 "host": "xxx",
198 "port": "xxx"
Wen Congyang68365a32016-07-27 15:01:46 +0800199 }
200 }
201 }
202 }
Rao, Leieff708a2021-11-22 15:49:47 +0800203 { "execute": "nbd-server-add",
204 "arguments": {
205 "device": "colo1",
206 "writable": true
Wen Congyang68365a32016-07-27 15:01:46 +0800207 }
208 }
209
210 Note:
211 1. The export name in secondary QEMU command line is the secondary
212 disk's id.
213 2. The export name for the same disk must be the same
214 3. The qmp command nbd-server-start and nbd-server-add must be run
215 before running the qmp command migrate on primary QEMU
216 4. Active disk, hidden disk and nbd target's length should be the
217 same.
218 5. It is better to put active disk and hidden disk in ramdisk.
219 6. It is all a single argument to -drive, and you should ignore
220 the leading whitespace.
221
222After Failover:
223Primary:
224 The secondary host is down, so we should run the following qmp command
225 to remove the nbd child from the quorum:
Rao, Leieff708a2021-11-22 15:49:47 +0800226 { "execute": "x-blockdev-change",
227 "arguments": {
228 "parent": "colo1",
229 "child": "children.1"
Wen Congyang68365a32016-07-27 15:01:46 +0800230 }
231 }
Rao, Leieff708a2021-11-22 15:49:47 +0800232 { "execute": "human-monitor-command",
233 "arguments": {
234 "command-line": "drive_del xxxx"
Wen Congyang68365a32016-07-27 15:01:46 +0800235 }
236 }
237 Note: there is no qmp command to remove the blockdev now
238
239Secondary:
240 The primary host is down, so we should do the following thing:
Rao, Leieff708a2021-11-22 15:49:47 +0800241 { "execute": "nbd-server-stop" }
Wen Congyang68365a32016-07-27 15:01:46 +0800242
Lukas Straub90dfe592019-10-24 16:25:57 +0200243Promote Secondary to Primary:
244 see COLO-FT.txt
245
Wen Congyang68365a32016-07-27 15:01:46 +0800246TODO:
Lukas Straub90dfe592019-10-24 16:25:57 +02002471. Shared disk