| Block replication | 
 | ---------------------------------------- | 
 | Copyright Fujitsu, Corp. 2016 | 
 | Copyright (c) 2016 Intel Corporation | 
 | Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. | 
 |  | 
 | This work is licensed under the terms of the GNU GPL, version 2 or later. | 
 | See the COPYING file in the top-level directory. | 
 |  | 
 | Block replication is used for continuous checkpoints. It is designed | 
 | for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. | 
 | It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, | 
 | where the Secondary VM is not running. | 
 |  | 
 | This document gives an overview of block replication's design. | 
 |  | 
 | == Background == | 
 | High availability solutions such as micro checkpoint and COLO will do | 
 | consecutive checkpoints. The VM state of the Primary and Secondary VM is | 
 | identical right after a VM checkpoint, but becomes different as the VM | 
 | executes till the next checkpoint. To support disk contents checkpoint, | 
 | the modified disk contents in the Secondary VM must be buffered, and are | 
 | only dropped at next checkpoint time. To reduce the network transportation | 
 | effort during a vmstate checkpoint, the disk modification operations of | 
 | the Primary disk are asynchronously forwarded to the Secondary node. | 
 |  | 
 | == Workflow == | 
 | The following is the image of block replication workflow: | 
 |  | 
 |         +----------------------+            +------------------------+ | 
 |         |Primary Write Requests|            |Secondary Write Requests| | 
 |         +----------------------+            +------------------------+ | 
 |                   |                                       | | 
 |                   |                                      (4) | 
 |                   |                                       V | 
 |                   |                              /-------------\ | 
 |                   |      Copy and Forward        |             | | 
 |                   |---------(1)----------+       | Disk Buffer | | 
 |                   |                      |       |             | | 
 |                   |                     (3)      \-------------/ | 
 |                   |                 speculative      ^ | 
 |                   |                write through    (2) | 
 |                   |                      |           | | 
 |                   V                      V           | | 
 |            +--------------+           +----------------+ | 
 |            | Primary Disk |           | Secondary Disk | | 
 |            +--------------+           +----------------+ | 
 |  | 
 |     1) Primary write requests will be copied and forwarded to Secondary | 
 |        QEMU. | 
 |     2) Before Primary write requests are written to Secondary disk, the | 
 |        original sector content will be read from Secondary disk and | 
 |        buffered in the Disk buffer, but it will not overwrite the existing | 
 |        sector content (it could be from either "Secondary Write Requests" or | 
 |        previous COW of "Primary Write Requests") in the Disk buffer. | 
 |     3) Primary write requests will be written to Secondary disk. | 
 |     4) Secondary write requests will be buffered in the Disk buffer and it | 
 |        will overwrite the existing sector content in the buffer. | 
 |  | 
 | == Architecture == | 
 | We are going to implement block replication from many basic | 
 | blocks that are already in QEMU. | 
 |  | 
 |          virtio-blk       || | 
 |              ^            ||                            .---------- | 
 |              |            ||                            | Secondary | 
 |         1 Quorum          ||                            '---------- | 
 |          /      \         ||                                                           virtio-blk | 
 |         /        \        ||                                                               ^ | 
 |    Primary    2 filter                                                                     | | 
 |      disk         ^                                                                   7 Quorum | 
 |                   |                                                                    / | 
 |                 3 NBD  ------->  3 NBD                                                / | 
 |                 client    ||     server                                          2 filter | 
 |                           ||        ^                                                ^ | 
 | --------.                 ||        |                                                | | 
 | Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 | 
 | --------'                 ||        |          backing        ^       backing | 
 |                           ||        |                         | | 
 |                           ||        |                         | | 
 |                           ||        '-------------------------' | 
 |                           ||         blockdev-backup sync=none 6 | 
 |  | 
 | 1) The disk on the primary is represented by a block device with two | 
 | children, providing replication between a primary disk and the host that | 
 | runs the secondary VM. The read pattern (fifo) for quorum can be extended | 
 | to make the primary always read from the local disk instead of going through | 
 | NBD. | 
 |  | 
 | 2) The new block filter (the name is replication) will control the block | 
 | replication. | 
 |  | 
 | 3) The secondary disk receives writes from the primary VM through QEMU's | 
 | embedded NBD server (speculative write-through). | 
 |  | 
 | 4) The disk on the secondary is represented by a custom block device | 
 | (called active-disk). It should start as an empty disk, and the format | 
 | should support bdrv_make_empty() and backing file. | 
 |  | 
 | 5) The hidden-disk is created automatically. It buffers the original content | 
 | that is modified by the primary VM. It should also start as an empty disk, | 
 | and the driver supports bdrv_make_empty() and backing file. | 
 |  | 
 | 6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer | 
 | any state that would otherwise be lost by the speculative write-through | 
 | of the NBD server into the secondary disk. So before block replication, | 
 | the primary disk and secondary disk should contain the same data. | 
 |  | 
 | 7) The secondary also has a quorum node, so after secondary failover it | 
 | can become the new primary and continue replication. | 
 |  | 
 |  | 
 | == Failure Handling == | 
 | There are 7 internal errors when block replication is running: | 
 | 1. I/O error on primary disk | 
 | 2. Forwarding primary write requests failed | 
 | 3. Backup failed | 
 | 4. I/O error on secondary disk | 
 | 5. I/O error on active disk | 
 | 6. Making active disk or hidden disk empty failed | 
 | 7. Doing failover failed | 
 | In case 1 and 5, we just report the error to the disk layer. In case 2, 3, | 
 | 4 and 6, we just report block replication's error to FT/HA manager (which | 
 | decides when to do a new checkpoint, when to do failover). | 
 | In case 7, if active commit failed, we use replication failover failed state | 
 | in Secondary's write operation (what decides which target to write). | 
 |  | 
 | == New block driver interface == | 
 | We add four block driver interfaces to control block replication: | 
 | a. replication_start_all() | 
 |    Start block replication, called in migration/checkpoint thread. | 
 |    We must call block_replication_start_all() in secondary QEMU before | 
 |    calling block_replication_start_all() in primary QEMU. The caller | 
 |    must hold the I/O mutex lock if it is in migration/checkpoint | 
 |    thread. | 
 | b. replication_do_checkpoint_all() | 
 |    This interface is called after all VM state is transferred to | 
 |    Secondary QEMU. The Disk buffer will be dropped in this interface. | 
 |    The caller must hold the I/O mutex lock if it is in migration/checkpoint | 
 |    thread. | 
 | c. replication_get_error_all() | 
 |    This interface is called to check if error happened in replication. | 
 |    The caller must hold the I/O mutex lock if it is in migration/checkpoint | 
 |    thread. | 
 | d. replication_stop_all() | 
 |    It is called on failover. We will flush the Disk buffer into | 
 |    Secondary Disk and stop block replication. The vm should be stopped | 
 |    before calling it if you use this API to shutdown the guest, or other | 
 |    things except failover. The caller must hold the I/O mutex lock if it is | 
 |    in migration/checkpoint thread. | 
 |  | 
 | == Usage == | 
 | Primary: | 
 |   -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ | 
 |          children.0.file.filename=1.raw,\ | 
 |          children.0.driver=raw | 
 |  | 
 |   Run qmp command in primary qemu: | 
 |     { "execute": "human-monitor-command", | 
 |       "arguments": { | 
 |           "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1" | 
 |       } | 
 |     } | 
 |     { "execute": "x-blockdev-change", | 
 |       "arguments": { | 
 |           "parent": "colo1", | 
 |           "node": "nbd_client1" | 
 |       } | 
 |     } | 
 |   Note: | 
 |   1. There should be only one NBD Client for each primary disk. | 
 |   2. host is the secondary physical machine's hostname or IP | 
 |   3. Each disk must have its own export name. | 
 |   4. It is all a single argument to -drive and you should ignore the | 
 |      leading whitespace. | 
 |   5. The qmp command line must be run after running qmp command line in | 
 |      secondary qemu. | 
 |   6. After primary failover we need remove children.1 (replication driver). | 
 |  | 
 | Secondary: | 
 |   -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ | 
 |   -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1 | 
 |          file.file.filename=active_disk.qcow2,\ | 
 |          file.driver=qcow2,\ | 
 |          file.backing.file.filename=hidden_disk.qcow2,\ | 
 |          file.backing.driver=qcow2,\ | 
 |          file.backing.backing=colo1 | 
 |   -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\ | 
 |          vote-threshold=1,children.0=childs1 | 
 |  | 
 |   Then run qmp command in secondary qemu: | 
 |     { "execute": "nbd-server-start", | 
 |       "arguments": { | 
 |           "addr": { | 
 |               "type": "inet", | 
 |               "data": { | 
 |                   "host": "xxx", | 
 |                   "port": "xxx" | 
 |               } | 
 |           } | 
 |       } | 
 |     } | 
 |     { "execute": "nbd-server-add", | 
 |       "arguments": { | 
 |           "device": "colo1", | 
 |           "writable": true | 
 |       } | 
 |     } | 
 |  | 
 |   Note: | 
 |   1. The export name in secondary QEMU command line is the secondary | 
 |      disk's id. | 
 |   2. The export name for the same disk must be the same | 
 |   3. The qmp command nbd-server-start and nbd-server-add must be run | 
 |      before running the qmp command migrate on primary QEMU | 
 |   4. Active disk, hidden disk and nbd target's length should be the | 
 |      same. | 
 |   5. It is better to put active disk and hidden disk in ramdisk. | 
 |   6. It is all a single argument to -drive, and you should ignore | 
 |      the leading whitespace. | 
 |  | 
 | After Failover: | 
 | Primary: | 
 |   The secondary host is down, so we should run the following qmp command | 
 |   to remove the nbd child from the quorum: | 
 |   { "execute": "x-blockdev-change", | 
 |     "arguments": { | 
 |         "parent": "colo1", | 
 |         "child": "children.1" | 
 |     } | 
 |   } | 
 |   { "execute": "human-monitor-command", | 
 |     "arguments": { | 
 |         "command-line": "drive_del xxxx" | 
 |     } | 
 |   } | 
 |   Note: there is no qmp command to remove the blockdev now | 
 |  | 
 | Secondary: | 
 |   The primary host is down, so we should do the following thing: | 
 |   { "execute": "nbd-server-stop" } | 
 |  | 
 | Promote Secondary to Primary: | 
 |   see COLO-FT.txt | 
 |  | 
 | TODO: | 
 | 1. Shared disk |