| Block replication |
| ---------------------------------------- |
| Copyright Fujitsu, Corp. 2016 |
| Copyright (c) 2016 Intel Corporation |
| Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. |
| |
| This work is licensed under the terms of the GNU GPL, version 2 or later. |
| See the COPYING file in the top-level directory. |
| |
| Block replication is used for continuous checkpoints. It is designed |
| for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. |
| It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, |
| where the Secondary VM is not running. |
| |
| This document gives an overview of block replication's design. |
| |
| == Background == |
| High availability solutions such as micro checkpoint and COLO will do |
| consecutive checkpoints. The VM state of the Primary and Secondary VM is |
| identical right after a VM checkpoint, but becomes different as the VM |
| executes till the next checkpoint. To support disk contents checkpoint, |
| the modified disk contents in the Secondary VM must be buffered, and are |
| only dropped at next checkpoint time. To reduce the network transportation |
| effort during a vmstate checkpoint, the disk modification operations of |
| the Primary disk are asynchronously forwarded to the Secondary node. |
| |
| == Workflow == |
| The following is the image of block replication workflow: |
| |
| +----------------------+ +------------------------+ |
| |Primary Write Requests| |Secondary Write Requests| |
| +----------------------+ +------------------------+ |
| | | |
| | (4) |
| | V |
| | /-------------\ |
| | Copy and Forward | | |
| |---------(1)----------+ | Disk Buffer | |
| | | | | |
| | (3) \-------------/ |
| | speculative ^ |
| | write through (2) |
| | | | |
| V V | |
| +--------------+ +----------------+ |
| | Primary Disk | | Secondary Disk | |
| +--------------+ +----------------+ |
| |
| 1) Primary write requests will be copied and forwarded to Secondary |
| QEMU. |
| 2) Before Primary write requests are written to Secondary disk, the |
| original sector content will be read from Secondary disk and |
| buffered in the Disk buffer, but it will not overwrite the existing |
| sector content (it could be from either "Secondary Write Requests" or |
| previous COW of "Primary Write Requests") in the Disk buffer. |
| 3) Primary write requests will be written to Secondary disk. |
| 4) Secondary write requests will be buffered in the Disk buffer and it |
| will overwrite the existing sector content in the buffer. |
| |
| == Architecture == |
| We are going to implement block replication from many basic |
| blocks that are already in QEMU. |
| |
| virtio-blk || |
| ^ || .---------- |
| | || | Secondary |
| 1 Quorum || '---------- |
| / \ || virtio-blk |
| / \ || ^ |
| Primary 2 filter | |
| disk ^ 7 Quorum |
| | / |
| 3 NBD -------> 3 NBD / |
| client || server 2 filter |
| || ^ ^ |
| --------. || | | |
| Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 |
| --------' || | backing ^ backing |
| || | | |
| || | | |
| || '-------------------------' |
| || blockdev-backup sync=none 6 |
| |
| 1) The disk on the primary is represented by a block device with two |
| children, providing replication between a primary disk and the host that |
| runs the secondary VM. The read pattern (fifo) for quorum can be extended |
| to make the primary always read from the local disk instead of going through |
| NBD. |
| |
| 2) The new block filter (the name is replication) will control the block |
| replication. |
| |
| 3) The secondary disk receives writes from the primary VM through QEMU's |
| embedded NBD server (speculative write-through). |
| |
| 4) The disk on the secondary is represented by a custom block device |
| (called active-disk). It should start as an empty disk, and the format |
| should support bdrv_make_empty() and backing file. |
| |
| 5) The hidden-disk is created automatically. It buffers the original content |
| that is modified by the primary VM. It should also start as an empty disk, |
| and the driver supports bdrv_make_empty() and backing file. |
| |
| 6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer |
| any state that would otherwise be lost by the speculative write-through |
| of the NBD server into the secondary disk. So before block replication, |
| the primary disk and secondary disk should contain the same data. |
| |
| 7) The secondary also has a quorum node, so after secondary failover it |
| can become the new primary and continue replication. |
| |
| |
| == Failure Handling == |
| There are 7 internal errors when block replication is running: |
| 1. I/O error on primary disk |
| 2. Forwarding primary write requests failed |
| 3. Backup failed |
| 4. I/O error on secondary disk |
| 5. I/O error on active disk |
| 6. Making active disk or hidden disk empty failed |
| 7. Doing failover failed |
| In case 1 and 5, we just report the error to the disk layer. In case 2, 3, |
| 4 and 6, we just report block replication's error to FT/HA manager (which |
| decides when to do a new checkpoint, when to do failover). |
| In case 7, if active commit failed, we use replication failover failed state |
| in Secondary's write operation (what decides which target to write). |
| |
| == New block driver interface == |
| We add four block driver interfaces to control block replication: |
| a. replication_start_all() |
| Start block replication, called in migration/checkpoint thread. |
| We must call block_replication_start_all() in secondary QEMU before |
| calling block_replication_start_all() in primary QEMU. The caller |
| must hold the I/O mutex lock if it is in migration/checkpoint |
| thread. |
| b. replication_do_checkpoint_all() |
| This interface is called after all VM state is transferred to |
| Secondary QEMU. The Disk buffer will be dropped in this interface. |
| The caller must hold the I/O mutex lock if it is in migration/checkpoint |
| thread. |
| c. replication_get_error_all() |
| This interface is called to check if error happened in replication. |
| The caller must hold the I/O mutex lock if it is in migration/checkpoint |
| thread. |
| d. replication_stop_all() |
| It is called on failover. We will flush the Disk buffer into |
| Secondary Disk and stop block replication. The vm should be stopped |
| before calling it if you use this API to shutdown the guest, or other |
| things except failover. The caller must hold the I/O mutex lock if it is |
| in migration/checkpoint thread. |
| |
| == Usage == |
| Primary: |
| -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ |
| children.0.file.filename=1.raw,\ |
| children.0.driver=raw |
| |
| Run qmp command in primary qemu: |
| { "execute": "human-monitor-command", |
| "arguments": { |
| "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1" |
| } |
| } |
| { "execute": "x-blockdev-change", |
| "arguments": { |
| "parent": "colo1", |
| "node": "nbd_client1" |
| } |
| } |
| Note: |
| 1. There should be only one NBD Client for each primary disk. |
| 2. host is the secondary physical machine's hostname or IP |
| 3. Each disk must have its own export name. |
| 4. It is all a single argument to -drive and you should ignore the |
| leading whitespace. |
| 5. The qmp command line must be run after running qmp command line in |
| secondary qemu. |
| 6. After primary failover we need remove children.1 (replication driver). |
| |
| Secondary: |
| -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ |
| -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=childs1 |
| file.file.filename=active_disk.qcow2,\ |
| file.driver=qcow2,\ |
| file.backing.file.filename=hidden_disk.qcow2,\ |
| file.backing.driver=qcow2,\ |
| file.backing.backing=colo1 |
| -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\ |
| vote-threshold=1,children.0=childs1 |
| |
| Then run qmp command in secondary qemu: |
| { "execute": "nbd-server-start", |
| "arguments": { |
| "addr": { |
| "type": "inet", |
| "data": { |
| "host": "xxx", |
| "port": "xxx" |
| } |
| } |
| } |
| } |
| { "execute": "nbd-server-add", |
| "arguments": { |
| "device": "colo1", |
| "writable": true |
| } |
| } |
| |
| Note: |
| 1. The export name in secondary QEMU command line is the secondary |
| disk's id. |
| 2. The export name for the same disk must be the same |
| 3. The qmp command nbd-server-start and nbd-server-add must be run |
| before running the qmp command migrate on primary QEMU |
| 4. Active disk, hidden disk and nbd target's length should be the |
| same. |
| 5. It is better to put active disk and hidden disk in ramdisk. |
| 6. It is all a single argument to -drive, and you should ignore |
| the leading whitespace. |
| |
| After Failover: |
| Primary: |
| The secondary host is down, so we should run the following qmp command |
| to remove the nbd child from the quorum: |
| { "execute": "x-blockdev-change", |
| "arguments": { |
| "parent": "colo1", |
| "child": "children.1" |
| } |
| } |
| { "execute": "human-monitor-command", |
| "arguments": { |
| "command-line": "drive_del xxxx" |
| } |
| } |
| Note: there is no qmp command to remove the blockdev now |
| |
| Secondary: |
| The primary host is down, so we should do the following thing: |
| { "execute": "nbd-server-stop" } |
| |
| Promote Secondary to Primary: |
| see COLO-FT.txt |
| |
| TODO: |
| 1. Shared disk |