Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 1 | Block replication |
| 2 | ---------------------------------------- |
| 3 | Copyright Fujitsu, Corp. 2016 |
| 4 | Copyright (c) 2016 Intel Corporation |
| 5 | Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. |
| 6 | |
| 7 | This work is licensed under the terms of the GNU GPL, version 2 or later. |
| 8 | See the COPYING file in the top-level directory. |
| 9 | |
| 10 | Block replication is used for continuous checkpoints. It is designed |
| 11 | for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. |
| 12 | It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, |
| 13 | where the Secondary VM is not running. |
| 14 | |
| 15 | This document gives an overview of block replication's design. |
| 16 | |
| 17 | == Background == |
| 18 | High availability solutions such as micro checkpoint and COLO will do |
| 19 | consecutive checkpoints. The VM state of the Primary and Secondary VM is |
| 20 | identical right after a VM checkpoint, but becomes different as the VM |
| 21 | executes till the next checkpoint. To support disk contents checkpoint, |
| 22 | the modified disk contents in the Secondary VM must be buffered, and are |
| 23 | only dropped at next checkpoint time. To reduce the network transportation |
| 24 | effort during a vmstate checkpoint, the disk modification operations of |
| 25 | the Primary disk are asynchronously forwarded to the Secondary node. |
| 26 | |
| 27 | == Workflow == |
| 28 | The following is the image of block replication workflow: |
| 29 | |
| 30 | +----------------------+ +------------------------+ |
| 31 | |Primary Write Requests| |Secondary Write Requests| |
| 32 | +----------------------+ +------------------------+ |
| 33 | | | |
| 34 | | (4) |
| 35 | | V |
| 36 | | /-------------\ |
| 37 | | Copy and Forward | | |
| 38 | |---------(1)----------+ | Disk Buffer | |
| 39 | | | | | |
| 40 | | (3) \-------------/ |
| 41 | | speculative ^ |
| 42 | | write through (2) |
| 43 | | | | |
| 44 | V V | |
| 45 | +--------------+ +----------------+ |
| 46 | | Primary Disk | | Secondary Disk | |
| 47 | +--------------+ +----------------+ |
| 48 | |
| 49 | 1) Primary write requests will be copied and forwarded to Secondary |
| 50 | QEMU. |
| 51 | 2) Before Primary write requests are written to Secondary disk, the |
| 52 | original sector content will be read from Secondary disk and |
| 53 | buffered in the Disk buffer, but it will not overwrite the existing |
| 54 | sector content (it could be from either "Secondary Write Requests" or |
| 55 | previous COW of "Primary Write Requests") in the Disk buffer. |
| 56 | 3) Primary write requests will be written to Secondary disk. |
| 57 | 4) Secondary write requests will be buffered in the Disk buffer and it |
| 58 | will overwrite the existing sector content in the buffer. |
| 59 | |
| 60 | == Architecture == |
| 61 | We are going to implement block replication from many basic |
| 62 | blocks that are already in QEMU. |
| 63 | |
| 64 | virtio-blk || |
| 65 | ^ || .---------- |
| 66 | | || | Secondary |
| 67 | 1 Quorum || '---------- |
Lukas Straub | 90dfe59 | 2019-10-24 16:25:57 +0200 | [diff] [blame] | 68 | / \ || virtio-blk |
| 69 | / \ || ^ |
| 70 | Primary 2 filter | |
| 71 | disk ^ 7 Quorum |
| 72 | | / |
| 73 | 3 NBD -------> 3 NBD / |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 74 | client || server 2 filter |
| 75 | || ^ ^ |
| 76 | --------. || | | |
| 77 | Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 |
| 78 | --------' || | backing ^ backing |
| 79 | || | | |
| 80 | || | | |
| 81 | || '-------------------------' |
Vladimir Sementsov-Ogievskiy | 9a59921 | 2021-11-04 09:58:09 +0100 | [diff] [blame] | 82 | || blockdev-backup sync=none 6 |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 83 | |
| 84 | 1) The disk on the primary is represented by a block device with two |
| 85 | children, providing replication between a primary disk and the host that |
| 86 | runs the secondary VM. The read pattern (fifo) for quorum can be extended |
| 87 | to make the primary always read from the local disk instead of going through |
| 88 | NBD. |
| 89 | |
| 90 | 2) The new block filter (the name is replication) will control the block |
| 91 | replication. |
| 92 | |
| 93 | 3) The secondary disk receives writes from the primary VM through QEMU's |
| 94 | embedded NBD server (speculative write-through). |
| 95 | |
| 96 | 4) The disk on the secondary is represented by a custom block device |
| 97 | (called active-disk). It should start as an empty disk, and the format |
| 98 | should support bdrv_make_empty() and backing file. |
| 99 | |
| 100 | 5) The hidden-disk is created automatically. It buffers the original content |
| 101 | that is modified by the primary VM. It should also start as an empty disk, |
| 102 | and the driver supports bdrv_make_empty() and backing file. |
| 103 | |
Vladimir Sementsov-Ogievskiy | 9a59921 | 2021-11-04 09:58:09 +0100 | [diff] [blame] | 104 | 6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 105 | any state that would otherwise be lost by the speculative write-through |
| 106 | of the NBD server into the secondary disk. So before block replication, |
| 107 | the primary disk and secondary disk should contain the same data. |
| 108 | |
Lukas Straub | 90dfe59 | 2019-10-24 16:25:57 +0200 | [diff] [blame] | 109 | 7) The secondary also has a quorum node, so after secondary failover it |
| 110 | can become the new primary and continue replication. |
| 111 | |
| 112 | |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 113 | == Failure Handling == |
| 114 | There are 7 internal errors when block replication is running: |
| 115 | 1. I/O error on primary disk |
| 116 | 2. Forwarding primary write requests failed |
| 117 | 3. Backup failed |
| 118 | 4. I/O error on secondary disk |
| 119 | 5. I/O error on active disk |
| 120 | 6. Making active disk or hidden disk empty failed |
| 121 | 7. Doing failover failed |
| 122 | In case 1 and 5, we just report the error to the disk layer. In case 2, 3, |
| 123 | 4 and 6, we just report block replication's error to FT/HA manager (which |
| 124 | decides when to do a new checkpoint, when to do failover). |
| 125 | In case 7, if active commit failed, we use replication failover failed state |
| 126 | in Secondary's write operation (what decides which target to write). |
| 127 | |
| 128 | == New block driver interface == |
| 129 | We add four block driver interfaces to control block replication: |
| 130 | a. replication_start_all() |
| 131 | Start block replication, called in migration/checkpoint thread. |
| 132 | We must call block_replication_start_all() in secondary QEMU before |
| 133 | calling block_replication_start_all() in primary QEMU. The caller |
| 134 | must hold the I/O mutex lock if it is in migration/checkpoint |
| 135 | thread. |
| 136 | b. replication_do_checkpoint_all() |
| 137 | This interface is called after all VM state is transferred to |
| 138 | Secondary QEMU. The Disk buffer will be dropped in this interface. |
| 139 | The caller must hold the I/O mutex lock if it is in migration/checkpoint |
| 140 | thread. |
| 141 | c. replication_get_error_all() |
| 142 | This interface is called to check if error happened in replication. |
| 143 | The caller must hold the I/O mutex lock if it is in migration/checkpoint |
| 144 | thread. |
| 145 | d. replication_stop_all() |
| 146 | It is called on failover. We will flush the Disk buffer into |
| 147 | Secondary Disk and stop block replication. The vm should be stopped |
| 148 | before calling it if you use this API to shutdown the guest, or other |
| 149 | things except failover. The caller must hold the I/O mutex lock if it is |
| 150 | in migration/checkpoint thread. |
| 151 | |
| 152 | == Usage == |
| 153 | Primary: |
| 154 | -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ |
| 155 | children.0.file.filename=1.raw,\ |
| 156 | children.0.driver=raw |
| 157 | |
| 158 | Run qmp command in primary qemu: |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 159 | { "execute": "human-monitor-command", |
| 160 | "arguments": { |
| 161 | "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1" |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 162 | } |
| 163 | } |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 164 | { "execute": "x-blockdev-change", |
| 165 | "arguments": { |
| 166 | "parent": "colo1", |
| 167 | "node": "nbd_client1" |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 168 | } |
| 169 | } |
| 170 | Note: |
| 171 | 1. There should be only one NBD Client for each primary disk. |
| 172 | 2. host is the secondary physical machine's hostname or IP |
| 173 | 3. Each disk must have its own export name. |
| 174 | 4. It is all a single argument to -drive and you should ignore the |
| 175 | leading whitespace. |
| 176 | 5. The qmp command line must be run after running qmp command line in |
| 177 | secondary qemu. |
Lukas Straub | 90dfe59 | 2019-10-24 16:25:57 +0200 | [diff] [blame] | 178 | 6. After primary failover we need remove children.1 (replication driver). |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 179 | |
| 180 | Secondary: |
| 181 | -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ |
Zhang Chen | 036ef34 | 2021-10-18 16:50:44 +0800 | [diff] [blame] | 182 | -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1 |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 183 | file.file.filename=active_disk.qcow2,\ |
| 184 | file.driver=qcow2,\ |
| 185 | file.backing.file.filename=hidden_disk.qcow2,\ |
| 186 | file.backing.driver=qcow2,\ |
| 187 | file.backing.backing=colo1 |
Lukas Straub | 90dfe59 | 2019-10-24 16:25:57 +0200 | [diff] [blame] | 188 | -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\ |
| 189 | vote-threshold=1,children.0=childs1 |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 190 | |
| 191 | Then run qmp command in secondary qemu: |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 192 | { "execute": "nbd-server-start", |
| 193 | "arguments": { |
| 194 | "addr": { |
| 195 | "type": "inet", |
| 196 | "data": { |
| 197 | "host": "xxx", |
| 198 | "port": "xxx" |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 199 | } |
| 200 | } |
| 201 | } |
| 202 | } |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 203 | { "execute": "nbd-server-add", |
| 204 | "arguments": { |
| 205 | "device": "colo1", |
| 206 | "writable": true |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 207 | } |
| 208 | } |
| 209 | |
| 210 | Note: |
| 211 | 1. The export name in secondary QEMU command line is the secondary |
| 212 | disk's id. |
| 213 | 2. The export name for the same disk must be the same |
| 214 | 3. The qmp command nbd-server-start and nbd-server-add must be run |
| 215 | before running the qmp command migrate on primary QEMU |
| 216 | 4. Active disk, hidden disk and nbd target's length should be the |
| 217 | same. |
| 218 | 5. It is better to put active disk and hidden disk in ramdisk. |
| 219 | 6. It is all a single argument to -drive, and you should ignore |
| 220 | the leading whitespace. |
| 221 | |
| 222 | After Failover: |
| 223 | Primary: |
| 224 | The secondary host is down, so we should run the following qmp command |
| 225 | to remove the nbd child from the quorum: |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 226 | { "execute": "x-blockdev-change", |
| 227 | "arguments": { |
| 228 | "parent": "colo1", |
| 229 | "child": "children.1" |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 230 | } |
| 231 | } |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 232 | { "execute": "human-monitor-command", |
| 233 | "arguments": { |
| 234 | "command-line": "drive_del xxxx" |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 235 | } |
| 236 | } |
| 237 | Note: there is no qmp command to remove the blockdev now |
| 238 | |
| 239 | Secondary: |
| 240 | The primary host is down, so we should do the following thing: |
Rao, Lei | eff708a | 2021-11-22 15:49:47 +0800 | [diff] [blame] | 241 | { "execute": "nbd-server-stop" } |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 242 | |
Lukas Straub | 90dfe59 | 2019-10-24 16:25:57 +0200 | [diff] [blame] | 243 | Promote Secondary to Primary: |
| 244 | see COLO-FT.txt |
| 245 | |
Wen Congyang | 68365a3 | 2016-07-27 15:01:46 +0800 | [diff] [blame] | 246 | TODO: |
Lukas Straub | 90dfe59 | 2019-10-24 16:25:57 +0200 | [diff] [blame] | 247 | 1. Shared disk |