| COarse-grained LOck-stepping Virtual Machines for Non-stop Service |
| ---------------------------------------- |
| Copyright (c) 2016 Intel Corporation |
| Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. |
| Copyright (c) 2016 Fujitsu, Corp. |
| |
| This work is licensed under the terms of the GNU GPL, version 2 or later. |
| See the COPYING file in the top-level directory. |
| |
| This document gives an overview of COLO's design and how to use it. |
| |
| == Background == |
| Virtual machine (VM) replication is a well known technique for providing |
| application-agnostic software-implemented hardware fault tolerance, |
| also known as "non-stop service". |
| |
| COLO (COarse-grained LOck-stepping) is a high availability solution. |
| Both primary VM (PVM) and secondary VM (SVM) run in parallel. They receive the |
| same request from client, and generate response in parallel too. |
| If the response packets from PVM and SVM are identical, they are released |
| immediately. Otherwise, a VM checkpoint (on demand) is conducted. |
| |
| == Architecture == |
| |
| The architecture of COLO is shown in the diagram below. |
| It consists of a pair of networked physical nodes: |
| The primary node running the PVM, and the secondary node running the SVM |
| to maintain a valid replica of the PVM. |
| PVM and SVM execute in parallel and generate output of response packets for |
| client requests according to the application semantics. |
| |
| The incoming packets from the client or external network are received by the |
| primary node, and then forwarded to the secondary node, so that both the PVM |
| and the SVM are stimulated with the same requests. |
| |
| COLO receives the outbound packets from both the PVM and SVM and compares them |
| before allowing the output to be sent to clients. |
| |
| The SVM is qualified as a valid replica of the PVM, as long as it generates |
| identical responses to all client requests. Once the differences in the outputs |
| are detected between the PVM and SVM, COLO withholds transmission of the |
| outbound packets until it has successfully synchronized the PVM state to the SVM. |
| |
| Primary Node Secondary Node |
| +------------+ +-----------------------+ +------------------------+ +------------+ |
| | | | HeartBeat +<----->+ HeartBeat | | | |
| | Primary VM | +-----------+-----------+ +-----------+------------+ |Secondary VM| |
| | | | | | | |
| | | +-----------|-----------+ +-----------|------------+ | | |
| | | |QEMU +---v----+ | |QEMU +----v---+ | | | |
| | | | |Failover| | | |Failover| | | | |
| | | | +--------+ | | +--------+ | | | |
| | | | +---------------+ | | +---------------+ | | | |
| | | | | VM Checkpoint +-------------->+ VM Checkpoint | | | | |
| | | | +---------------+ | | +---------------+ | | | |
| |Requests<--------------------------\ /-----------------\ /--------------------->Requests| |
| | | | ^ ^ | | | | | | | |
| |Responses+---------------------\ /-|-|------------\ /-------------------------+Responses| |
| | | | | | | | | | | | | | | | | |
| | | | +-----------+ | | | | | | | | | | +----------+ | | | |
| | | | | COLO disk | | | | | | | | | | | | COLO disk| | | | |
| | | | | Manager +---------------------------->| Manager | | | | |
| | | | ++----------+ v v | | | | | v v | +---------++ | | | |
| | | | |+-----------+-+-+-++| | ++-+--+-+---------+ | | | | |
| | | | || COLO Proxy || | | COLO Proxy | | | | | |
| | | | || (compare packet || | |(adjust sequence | | | | | |
| | | | ||and mirror packet)|| | | and ACK) | | | | | |
| | | | |+------------+---+-+| | +-----------------+ | | | | |
| +------------+ +-----------------------+ +------------------------+ +------------+ |
| +------------+ | | | | +------------+ |
| | VM Monitor | | | | | | VM Monitor | |
| +------------+ | | | | +------------+ |
| +---------------------------------------+ +----------------------------------------+ |
| | Kernel | | | | | Kernel | | |
| +---------------------------------------+ +----------------------------------------+ |
| | | | | |
| +--------------v+ +---------v---+--+ +------------------+ +v-------------+ |
| | Storage | |External Network| | External Network | | Storage | |
| +---------------+ +----------------+ +------------------+ +--------------+ |
| |
| |
| == Components introduction == |
| |
| You can see there are several components in COLO's diagram of architecture. |
| Their functions are described below. |
| |
| HeartBeat: |
| Runs on both the primary and secondary nodes, to periodically check platform |
| availability. When the primary node suffers a hardware fail-stop failure, |
| the heartbeat stops responding, the secondary node will trigger a failover |
| as soon as it determines the absence. |
| |
| COLO disk Manager: |
| When primary VM writes data into image, the colo disk manager captures this data |
| and sends it to secondary VM's which makes sure the context of secondary VM's |
| image is consistent with the context of primary VM 's image. |
| For more details, please refer to docs/block-replication.txt. |
| |
| Checkpoint/Failover Controller: |
| Modifications of save/restore flow to realize continuous migration, |
| to make sure the state of VM in Secondary side is always consistent with VM in |
| Primary side. |
| |
| COLO Proxy: |
| Delivers packets to Primary and Secondary, and then compare the responses from |
| both side. Then decide whether to start a checkpoint according to some rules. |
| Please refer to docs/colo-proxy.txt for more information. |
| |
| Note: |
| HeartBeat has not been implemented yet, so you need to trigger failover process |
| by using 'x-colo-lost-heartbeat' command. |
| |
| == COLO operation status == |
| |
| +-----------------+ |
| | | |
| | Start COLO | |
| | | |
| +--------+--------+ |
| | |
| | Main qmp command: |
| | migrate-set-capabilities with x-colo |
| | migrate |
| | |
| v |
| +--------+--------+ |
| | | |
| | COLO running | |
| | | |
| +--------+--------+ |
| | |
| | Main qmp command: |
| | x-colo-lost-heartbeat |
| | or |
| | some error happened |
| v |
| +--------+--------+ |
| | | send qmp event: |
| | COLO failover | COLO_EXIT |
| | | |
| +-----------------+ |
| |
| COLO use the qmp command to switch and report operation status. |
| The diagram just shows the main qmp command, you can get the detail |
| in test procedure. |
| |
| == Test procedure == |
| Note: Here we are running both instances on the same host for testing, |
| change the IP Addresses if you want to run it on two hosts. Initially |
| 127.0.0.1 is the Primary Host and 127.0.0.2 is the Secondary Host. |
| |
| == Startup qemu == |
| 1. Primary: |
| Note: Initially, $imagefolder/primary.qcow2 needs to be copied to all hosts. |
| You don't need to change any IP's here, because 0.0.0.0 listens on any |
| interface. The chardev's with 127.0.0.1 IP's loopback to the local qemu |
| instance. |
| |
| # imagefolder="/mnt/vms/colo-test-primary" |
| |
| # qemu-system-x86_64 -enable-kvm -cpu qemu64,+kvmclock -m 512 -smp 1 -qmp stdio \ |
| -device piix3-usb-uhci -device usb-tablet -name primary \ |
| -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \ |
| -device rtl8139,id=e0,netdev=hn0 \ |
| -chardev socket,id=mirror0,host=0.0.0.0,port=9003,server,nowait \ |
| -chardev socket,id=compare1,host=0.0.0.0,port=9004,server,wait \ |
| -chardev socket,id=compare0,host=127.0.0.1,port=9001,server,nowait \ |
| -chardev socket,id=compare0-0,host=127.0.0.1,port=9001 \ |
| -chardev socket,id=compare_out,host=127.0.0.1,port=9005,server,nowait \ |
| -chardev socket,id=compare_out0,host=127.0.0.1,port=9005 \ |
| -object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0 \ |
| -object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out \ |
| -object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0 \ |
| -object iothread,id=iothread1 \ |
| -object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,\ |
| outdev=compare_out0,iothread=iothread1 \ |
| -drive if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\ |
| children.0.file.filename=$imagefolder/primary.qcow2,children.0.driver=qcow2 -S |
| |
| 2. Secondary: |
| Note: Active and hidden images need to be created only once and the |
| size should be the same as primary.qcow2. Again, you don't need to change |
| any IP's here, except for the $primary_ip variable. |
| |
| # imagefolder="/mnt/vms/colo-test-secondary" |
| # primary_ip=127.0.0.1 |
| |
| # qemu-img create -f qcow2 $imagefolder/secondary-active.qcow2 10G |
| |
| # qemu-img create -f qcow2 $imagefolder/secondary-hidden.qcow2 10G |
| |
| # qemu-system-x86_64 -enable-kvm -cpu qemu64,+kvmclock -m 512 -smp 1 -qmp stdio \ |
| -device piix3-usb-uhci -device usb-tablet -name secondary \ |
| -netdev tap,id=hn0,vhost=off,helper=/usr/lib/qemu/qemu-bridge-helper \ |
| -device rtl8139,id=e0,netdev=hn0 \ |
| -chardev socket,id=red0,host=$primary_ip,port=9003,reconnect=1 \ |
| -chardev socket,id=red1,host=$primary_ip,port=9004,reconnect=1 \ |
| -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0 \ |
| -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1 \ |
| -object filter-rewriter,id=rew0,netdev=hn0,queue=all \ |
| -drive if=none,id=parent0,file.filename=$imagefolder/primary.qcow2,driver=qcow2 \ |
| -drive if=none,id=childs0,driver=replication,mode=secondary,file.driver=qcow2,\ |
| top-id=colo-disk0,file.file.filename=$imagefolder/secondary-active.qcow2,\ |
| file.backing.driver=qcow2,file.backing.file.filename=$imagefolder/secondary-hidden.qcow2,\ |
| file.backing.backing=parent0 \ |
| -drive if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\ |
| children.0=childs0 \ |
| -incoming tcp:0.0.0.0:9998 |
| |
| |
| 3. On Secondary VM's QEMU monitor, issue command |
| {'execute':'qmp_capabilities'} |
| {'execute': 'nbd-server-start', 'arguments': {'addr': {'type': 'inet', 'data': {'host': '0.0.0.0', 'port': '9999'} } } } |
| {'execute': 'nbd-server-add', 'arguments': {'device': 'parent0', 'writable': true } } |
| |
| Note: |
| a. The qmp command nbd-server-start and nbd-server-add must be run |
| before running the qmp command migrate on primary QEMU |
| b. Active disk, hidden disk and nbd target's length should be the |
| same. |
| c. It is better to put active disk and hidden disk in ramdisk. They |
| will be merged into the parent disk on failover. |
| |
| 4. On Primary VM's QEMU monitor, issue command: |
| {'execute':'qmp_capabilities'} |
| {'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0'}} |
| {'execute': 'x-blockdev-change', 'arguments':{'parent': 'colo-disk0', 'node': 'replication0' } } |
| {'execute': 'migrate-set-capabilities', 'arguments': {'capabilities': [ {'capability': 'x-colo', 'state': true } ] } } |
| {'execute': 'migrate', 'arguments': {'uri': 'tcp:127.0.0.2:9998' } } |
| |
| Note: |
| a. There should be only one NBD Client for each primary disk. |
| b. The qmp command line must be run after running qmp command line in |
| secondary qemu. |
| |
| 5. After the above steps, you will see, whenever you make changes to PVM, SVM will be synced. |
| You can issue command '{ "execute": "migrate-set-parameters" , "arguments":{ "x-checkpoint-delay": 2000 } }' |
| to change the idle checkpoint period time |
| |
| 6. Failover test |
| You can kill one of the VMs and Failover on the surviving VM: |
| |
| If you killed the Secondary, then follow "Primary Failover". After that, |
| if you want to resume the replication, follow "Primary resume replication" |
| |
| If you killed the Primary, then follow "Secondary Failover". After that, |
| if you want to resume the replication, follow "Secondary resume replication" |
| |
| == Primary Failover == |
| The Secondary died, resume on the Primary |
| |
| {'execute': 'x-blockdev-change', 'arguments':{ 'parent': 'colo-disk0', 'child': 'children.1'} } |
| {'execute': 'human-monitor-command', 'arguments':{ 'command-line': 'drive_del replication0' } } |
| {'execute': 'object-del', 'arguments':{ 'id': 'comp0' } } |
| {'execute': 'object-del', 'arguments':{ 'id': 'iothread1' } } |
| {'execute': 'object-del', 'arguments':{ 'id': 'm0' } } |
| {'execute': 'object-del', 'arguments':{ 'id': 'redire0' } } |
| {'execute': 'object-del', 'arguments':{ 'id': 'redire1' } } |
| {'execute': 'x-colo-lost-heartbeat' } |
| |
| == Secondary Failover == |
| The Primary died, resume on the Secondary and prepare to become the new Primary |
| |
| {'execute': 'nbd-server-stop'} |
| {'execute': 'x-colo-lost-heartbeat'} |
| |
| {'execute': 'object-del', 'arguments':{ 'id': 'f2' } } |
| {'execute': 'object-del', 'arguments':{ 'id': 'f1' } } |
| {'execute': 'chardev-remove', 'arguments':{ 'id': 'red1' } } |
| {'execute': 'chardev-remove', 'arguments':{ 'id': 'red0' } } |
| |
| {'execute': 'chardev-add', 'arguments':{ 'id': 'mirror0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '0.0.0.0', 'port': '9003' } }, 'server': true } } } } |
| {'execute': 'chardev-add', 'arguments':{ 'id': 'compare1', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '0.0.0.0', 'port': '9004' } }, 'server': true } } } } |
| {'execute': 'chardev-add', 'arguments':{ 'id': 'compare0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9001' } }, 'server': true } } } } |
| {'execute': 'chardev-add', 'arguments':{ 'id': 'compare0-0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9001' } }, 'server': false } } } } |
| {'execute': 'chardev-add', 'arguments':{ 'id': 'compare_out', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9005' } }, 'server': true } } } } |
| {'execute': 'chardev-add', 'arguments':{ 'id': 'compare_out0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9005' } }, 'server': false } } } } |
| |
| == Primary resume replication == |
| Resume replication after new Secondary is up. |
| |
| Start the new Secondary (Steps 2 and 3 above), then on the Primary: |
| {'execute': 'drive-mirror', 'arguments':{ 'device': 'colo-disk0', 'job-id': 'resync', 'target': 'nbd://127.0.0.2:9999/parent0', 'mode': 'existing', 'format': 'raw', 'sync': 'full'} } |
| |
| Wait until disk is synced, then: |
| {'execute': 'stop'} |
| {'execute': 'block-job-cancel', 'arguments':{ 'device': 'resync'} } |
| |
| {'execute': 'human-monitor-command', 'arguments':{ 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0'}} |
| {'execute': 'x-blockdev-change', 'arguments':{ 'parent': 'colo-disk0', 'node': 'replication0' } } |
| |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-mirror', 'id': 'm0', 'props': { 'netdev': 'hn0', 'queue': 'tx', 'outdev': 'mirror0' } } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire0', 'props': { 'netdev': 'hn0', 'queue': 'rx', 'indev': 'compare_out' } } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire1', 'props': { 'netdev': 'hn0', 'queue': 'rx', 'outdev': 'compare0' } } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'iothread', 'id': 'iothread1' } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'colo-compare', 'id': 'comp0', 'props': { 'primary_in': 'compare0-0', 'secondary_in': 'compare1', 'outdev': 'compare_out0', 'iothread': 'iothread1' } } } |
| |
| {'execute': 'migrate-set-capabilities', 'arguments':{ 'capabilities': [ {'capability': 'x-colo', 'state': true } ] } } |
| {'execute': 'migrate', 'arguments':{ 'uri': 'tcp:127.0.0.2:9998' } } |
| |
| Note: |
| If this Primary previously was a Secondary, then we need to insert the |
| filters before the filter-rewriter by using the |
| "'insert': 'before', 'position': 'id=rew0'" Options. See below. |
| |
| == Secondary resume replication == |
| Become Primary and resume replication after new Secondary is up. Note |
| that now 127.0.0.1 is the Secondary and 127.0.0.2 is the Primary. |
| |
| Start the new Secondary (Steps 2 and 3 above, but with primary_ip=127.0.0.2), |
| then on the old Secondary: |
| {'execute': 'drive-mirror', 'arguments':{ 'device': 'colo-disk0', 'job-id': 'resync', 'target': 'nbd://127.0.0.1:9999/parent0', 'mode': 'existing', 'format': 'raw', 'sync': 'full'} } |
| |
| Wait until disk is synced, then: |
| {'execute': 'stop'} |
| {'execute': 'block-job-cancel', 'arguments':{ 'device': 'resync' } } |
| |
| {'execute': 'human-monitor-command', 'arguments':{ 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.1,file.port=9999,file.export=parent0,node-name=replication0'}} |
| {'execute': 'x-blockdev-change', 'arguments':{ 'parent': 'colo-disk0', 'node': 'replication0' } } |
| |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-mirror', 'id': 'm0', 'props': { 'insert': 'before', 'position': 'id=rew0', 'netdev': 'hn0', 'queue': 'tx', 'outdev': 'mirror0' } } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire0', 'props': { 'insert': 'before', 'position': 'id=rew0', 'netdev': 'hn0', 'queue': 'rx', 'indev': 'compare_out' } } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire1', 'props': { 'insert': 'before', 'position': 'id=rew0', 'netdev': 'hn0', 'queue': 'rx', 'outdev': 'compare0' } } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'iothread', 'id': 'iothread1' } } |
| {'execute': 'object-add', 'arguments':{ 'qom-type': 'colo-compare', 'id': 'comp0', 'props': { 'primary_in': 'compare0-0', 'secondary_in': 'compare1', 'outdev': 'compare_out0', 'iothread': 'iothread1' } } } |
| |
| {'execute': 'migrate-set-capabilities', 'arguments':{ 'capabilities': [ {'capability': 'x-colo', 'state': true } ] } } |
| {'execute': 'migrate', 'arguments':{ 'uri': 'tcp:127.0.0.1:9998' } } |
| |
| == TODO == |
| 1. Support shared storage. |
| 2. Develop the heartbeat part. |
| 3. Reduce checkpoint VM’s downtime while doing checkpoint. |