| ======== |
| Postcopy |
| ======== |
| |
| .. contents:: |
| |
| 'Postcopy' migration is a way to deal with migrations that refuse to converge |
| (or take too long to converge) its plus side is that there is an upper bound on |
| the amount of migration traffic and time it takes, the down side is that during |
| the postcopy phase, a failure of *either* side causes the guest to be lost. |
| |
| In postcopy the destination CPUs are started before all the memory has been |
| transferred, and accesses to pages that are yet to be transferred cause |
| a fault that's translated by QEMU into a request to the source QEMU. |
| |
| Postcopy can be combined with precopy (i.e. normal migration) so that if precopy |
| doesn't finish in a given time the switch is made to postcopy. |
| |
| Enabling postcopy |
| ================= |
| |
| To enable postcopy, issue this command on the monitor (both source and |
| destination) prior to the start of migration: |
| |
| ``migrate_set_capability postcopy-ram on`` |
| |
| The normal commands are then used to start a migration, which is still |
| started in precopy mode. Issuing: |
| |
| ``migrate_start_postcopy`` |
| |
| will now cause the transition from precopy to postcopy. |
| It can be issued immediately after migration is started or any |
| time later on. Issuing it after the end of a migration is harmless. |
| |
| Blocktime is a postcopy live migration metric, intended to show how |
| long the vCPU was in state of interruptible sleep due to pagefault. |
| That metric is calculated both for all vCPUs as overlapped value, and |
| separately for each vCPU. These values are calculated on destination |
| side. To enable postcopy blocktime calculation, enter following |
| command on destination monitor: |
| |
| ``migrate_set_capability postcopy-blocktime on`` |
| |
| Postcopy blocktime can be retrieved by query-migrate qmp command. |
| postcopy-blocktime value of qmp command will show overlapped blocking |
| time for all vCPU, postcopy-vcpu-blocktime will show list of blocking |
| time per vCPU. |
| |
| .. note:: |
| During the postcopy phase, the bandwidth limits set using |
| ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that |
| the destination is waiting for). |
| |
| Postcopy internals |
| ================== |
| |
| State machine |
| ------------- |
| |
| Postcopy moves through a series of states (see postcopy_state) from |
| ADVISE->DISCARD->LISTEN->RUNNING->END |
| |
| - Advise |
| |
| Set at the start of migration if postcopy is enabled, even |
| if it hasn't had the start command; here the destination |
| checks that its OS has the support needed for postcopy, and performs |
| setup to ensure the RAM mappings are suitable for later postcopy. |
| The destination will fail early in migration at this point if the |
| required OS support is not present. |
| (Triggered by reception of POSTCOPY_ADVISE command) |
| |
| - Discard |
| |
| Entered on receipt of the first 'discard' command; prior to |
| the first Discard being performed, hugepages are switched off |
| (using madvise) to ensure that no new huge pages are created |
| during the postcopy phase, and to cause any huge pages that |
| have discards on them to be broken. |
| |
| - Listen |
| |
| The first command in the package, POSTCOPY_LISTEN, switches |
| the destination state to Listen, and starts a new thread |
| (the 'listen thread') which takes over the job of receiving |
| pages off the migration stream, while the main thread carries |
| on processing the blob. With this thread able to process page |
| reception, the destination now 'sensitises' the RAM to detect |
| any access to missing pages (on Linux using the 'userfault' |
| system). |
| |
| - Running |
| |
| POSTCOPY_RUN causes the destination to synchronise all |
| state and start the CPUs and IO devices running. The main |
| thread now finishes processing the migration package and |
| now carries on as it would for normal precopy migration |
| (although it can't do the cleanup it would do as it |
| finishes a normal migration). |
| |
| - End |
| |
| The listen thread can now quit, and perform the cleanup of migration |
| state, the migration is now complete. |
| |
| Device transfer |
| --------------- |
| |
| Loading of device data may cause the device emulation to access guest RAM |
| that may trigger faults that have to be resolved by the source, as such |
| the migration stream has to be able to respond with page data *during* the |
| device load, and hence the device data has to be read from the stream completely |
| before the device load begins to free the stream up. This is achieved by |
| 'packaging' the device data into a blob that's read in one go. |
| |
| Source behaviour |
| ---------------- |
| |
| Until postcopy is entered the migration stream is identical to normal |
| precopy, except for the addition of a 'postcopy advise' command at |
| the beginning, to tell the destination that postcopy might happen. |
| When postcopy starts the source sends the page discard data and then |
| forms the 'package' containing: |
| |
| - Command: 'postcopy listen' |
| - The device state |
| |
| A series of sections, identical to the precopy streams device state stream |
| containing everything except postcopiable devices (i.e. RAM) |
| - Command: 'postcopy run' |
| |
| The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the |
| contents are formatted in the same way as the main migration stream. |
| |
| During postcopy the source scans the list of dirty pages and sends them |
| to the destination without being requested (in much the same way as precopy), |
| however when a page request is received from the destination, the dirty page |
| scanning restarts from the requested location. This causes requested pages |
| to be sent quickly, and also causes pages directly after the requested page |
| to be sent quickly in the hope that those pages are likely to be used |
| by the destination soon. |
| |
| Destination behaviour |
| --------------------- |
| |
| Initially the destination looks the same as precopy, with a single thread |
| reading the migration stream; the 'postcopy advise' and 'discard' commands |
| are processed to change the way RAM is managed, but don't affect the stream |
| processing. |
| |
| :: |
| |
| ------------------------------------------------------------------------------ |
| 1 2 3 4 5 6 7 |
| main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) |
| thread | | |
| | (page request) |
| | \___ |
| v \ |
| listen thread: --- page -- page -- page -- page -- page -- |
| |
| a b c |
| ------------------------------------------------------------------------------ |
| |
| - On receipt of ``CMD_PACKAGED`` (1) |
| |
| All the data associated with the package - the ( ... ) section in the diagram - |
| is read into memory, and the main thread recurses into qemu_loadvm_state_main |
| to process the contents of the package (2) which contains commands (3,6) and |
| devices (4...) |
| |
| - On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) |
| |
| a new thread (a) is started that takes over servicing the migration stream, |
| while the main thread carries on loading the package. It loads normal |
| background page data (b) but if during a device load a fault happens (5) |
| the returned page (c) is loaded by the listen thread allowing the main |
| threads device load to carry on. |
| |
| - The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) |
| |
| letting the destination CPUs start running. At the end of the |
| ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and |
| is no longer used by migration, while the listen thread carries on servicing |
| page data until the end of migration. |
| |
| Source side page bitmap |
| ----------------------- |
| |
| The 'migration bitmap' in postcopy is basically the same as in the precopy, |
| where each of the bit to indicate that page is 'dirty' - i.e. needs |
| sending. During the precopy phase this is updated as the CPU dirties |
| pages, however during postcopy the CPUs are stopped and nothing should |
| dirty anything any more. Instead, dirty bits are cleared when the relevant |
| pages are sent during postcopy. |
| |
| Postcopy features |
| ================= |
| |
| Postcopy recovery |
| ----------------- |
| |
| Comparing to precopy, postcopy is special on error handlings. When any |
| error happens (in this case, mostly network errors), QEMU cannot easily |
| fail a migration because VM data resides in both source and destination |
| QEMU instances. On the other hand, when issue happens QEMU on both sides |
| will go into a paused state. It'll need a recovery phase to continue a |
| paused postcopy migration. |
| |
| The recovery phase normally contains a few steps: |
| |
| - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED** |
| migration state. |
| |
| - When the network is recovered (or a new network is provided), the admin |
| can setup the new channel for migration using QMP command |
| 'migrate-recover' on destination node, preparing for a resume. |
| |
| - On source host, the admin can continue the interrupted postcopy |
| migration using QMP command 'migrate' with resume=true flag set. |
| Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to |
| re-establish the channels. |
| |
| - When both sides of QEMU successfully reconnect using a new or fixed up |
| channel, they will go into **POSTCOPY_RECOVER** state, some handshake |
| procedure will be needed to properly synchronize the VM states between |
| the two QEMUs to continue the postcopy migration. For example, there |
| can be pages sent right during the window when the network is |
| interrupted, then the handshake will guarantee pages lost in-flight |
| will be resent again. |
| |
| - After a proper handshake synchronization, QEMU will continue the |
| postcopy migration on both sides and go back to **POSTCOPY_ACTIVE** |
| state. Postcopy migration will continue. |
| |
| During a paused postcopy migration, the VM can logically still continue |
| running, and it will not be impacted from any page access to pages that |
| were already migrated to destination VM before the interruption happens. |
| However, if any of the missing pages got accessed on destination VM, the VM |
| thread will be halted waiting for the page to be migrated, it means it can |
| be halted until the recovery is complete. |
| |
| The impact of accessing missing pages can be relevant to different |
| configurations of the guest. For example, when with async page fault |
| enabled, logically the guest can proactively schedule out the threads |
| accessing missing pages. |
| |
| Postcopy with hugepages |
| ----------------------- |
| |
| Postcopy now works with hugetlbfs backed memory: |
| |
| a) The linux kernel on the destination must support userfault on hugepages. |
| b) The huge-page configuration on the source and destination VMs must be |
| identical; i.e. RAMBlocks on both sides must use the same page size. |
| c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal |
| RAM if it doesn't have enough hugepages, triggering (b) to fail. |
| Using ``-mem-prealloc`` enforces the allocation using hugepages. |
| d) Care should be taken with the size of hugepage used; postcopy with 2MB |
| hugepages works well, however 1GB hugepages are likely to be problematic |
| since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, |
| and until the full page is transferred the destination thread is blocked. |
| |
| Postcopy with shared memory |
| --------------------------- |
| |
| Postcopy migration with shared memory needs explicit support from the other |
| processes that share memory and from QEMU. There are restrictions on the type of |
| memory that userfault can support shared. |
| |
| The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` |
| (although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` |
| for hugetlbfs which may be a problem in some configurations). |
| |
| The vhost-user code in QEMU supports clients that have Postcopy support, |
| and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes |
| to support postcopy. |
| |
| The client needs to open a userfaultfd and register the areas |
| of memory that it maps with userfault. The client must then pass the |
| userfaultfd back to QEMU together with a mapping table that allows |
| fault addresses in the clients address space to be converted back to |
| RAMBlock/offsets. The client's userfaultfd is added to the postcopy |
| fault-thread and page requests are made on behalf of the client by QEMU. |
| QEMU performs 'wake' operations on the client's userfaultfd to allow it |
| to continue after a page has arrived. |
| |
| .. note:: |
| There are two future improvements that would be nice: |
| a) Some way to make QEMU ignorant of the addresses in the clients |
| address space |
| b) Avoiding the need for QEMU to perform ufd-wake calls after the |
| pages have arrived |
| |
| Retro-fitting postcopy to existing clients is possible: |
| a) A mechanism is needed for the registration with userfault as above, |
| and the registration needs to be coordinated with the phases of |
| postcopy. In vhost-user extra messages are added to the existing |
| control channel. |
| b) Any thread that can block due to guest memory accesses must be |
| identified and the implication understood; for example if the |
| guest memory access is made while holding a lock then all other |
| threads waiting for that lock will also be blocked. |
| |
| Postcopy preemption mode |
| ------------------------ |
| |
| Postcopy preempt is a new capability introduced in 8.0 QEMU release, it |
| allows urgent pages (those got page fault requested from destination QEMU |
| explicitly) to be sent in a separate preempt channel, rather than queued in |
| the background migration channel. Anyone who cares about latencies of page |
| faults during a postcopy migration should enable this feature. By default, |
| it's not enabled. |