|  | ===================== | 
|  | VFIO device Migration | 
|  | ===================== | 
|  |  | 
|  | Migration of virtual machine involves saving the state for each device that | 
|  | the guest is running on source host and restoring this saved state on the | 
|  | destination host. This document details how saving and restoring of VFIO | 
|  | devices is done in QEMU. | 
|  |  | 
|  | Migration of VFIO devices consists of two phases: the optional pre-copy phase, | 
|  | and the stop-and-copy phase. The pre-copy phase is iterative and allows to | 
|  | accommodate VFIO devices that have a large amount of data that needs to be | 
|  | transferred. The iterative pre-copy phase of migration allows for the guest to | 
|  | continue whilst the VFIO device state is transferred to the destination, this | 
|  | helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy | 
|  | support by reporting the VFIO_MIGRATION_PRE_COPY flag in the | 
|  | VFIO_DEVICE_FEATURE_MIGRATION ioctl. | 
|  |  | 
|  | When pre-copy is supported, it's possible to further reduce downtime by | 
|  | enabling "switchover-ack" migration capability. | 
|  | VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream | 
|  | and recommends that the initial bytes are sent and loaded in the destination | 
|  | before stopping the source VM. Enabling this migration capability will | 
|  | guarantee that and thus, can potentially reduce downtime even further. | 
|  |  | 
|  | To support migration of multiple devices that might do P2P transactions between | 
|  | themselves, VFIO migration uAPI defines an intermediate P2P quiescent state. | 
|  | While in the P2P quiescent state, P2P DMA transactions cannot be initiated by | 
|  | the device, but the device can respond to incoming ones. Additionally, all | 
|  | outstanding P2P transactions are guaranteed to have been completed by the time | 
|  | the device enters this state. | 
|  |  | 
|  | All the devices that support P2P migration are first transitioned to the P2P | 
|  | quiescent state and only then are they stopped or started. This makes migration | 
|  | safe P2P-wise, since starting and stopping the devices is not done atomically | 
|  | for all the devices together. | 
|  |  | 
|  | Thus, multiple VFIO devices migration is allowed only if all the devices | 
|  | support P2P migration. Single VFIO device migration is allowed regardless of | 
|  | P2P migration support. | 
|  |  | 
|  | A detailed description of the UAPI for VFIO device migration can be found in | 
|  | the comment for the ``vfio_device_mig_state`` structure in the header file | 
|  | linux-headers/linux/vfio.h. | 
|  |  | 
|  | VFIO implements the device hooks for the iterative approach as follows: | 
|  |  | 
|  | * A ``save_setup`` function that sets up migration on the source. | 
|  |  | 
|  | * A ``load_setup`` function that sets the VFIO device on the destination in | 
|  | _RESUMING state. | 
|  |  | 
|  | * A ``state_pending_estimate`` function that reports an estimate of the | 
|  | remaining pre-copy data that the vendor driver has yet to save for the VFIO | 
|  | device. | 
|  |  | 
|  | * A ``state_pending_exact`` function that reads pending_bytes from the vendor | 
|  | driver, which indicates the amount of data that the vendor driver has yet to | 
|  | save for the VFIO device. | 
|  |  | 
|  | * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is | 
|  | active only when the VFIO device is in pre-copy states. | 
|  |  | 
|  | * A ``save_live_iterate`` function that reads the VFIO device's data from the | 
|  | vendor driver during iterative pre-copy phase. | 
|  |  | 
|  | * A ``switchover_ack_needed`` function that checks if the VFIO device uses | 
|  | "switchover-ack" migration capability when this capability is enabled. | 
|  |  | 
|  | * A ``save_state`` function to save the device config space if it is present. | 
|  |  | 
|  | * A ``save_live_complete_precopy`` function that sets the VFIO device in | 
|  | _STOP_COPY state and iteratively copies the data for the VFIO device until | 
|  | the vendor driver indicates that no data remains. | 
|  |  | 
|  | * A ``load_state`` function that loads the config section and the data | 
|  | sections that are generated by the save functions above. | 
|  |  | 
|  | * ``cleanup`` functions for both save and load that perform any migration | 
|  | related cleanup. | 
|  |  | 
|  |  | 
|  | The VFIO migration code uses a VM state change handler to change the VFIO | 
|  | device state when the VM state changes from running to not-running, and | 
|  | vice versa. | 
|  |  | 
|  | Similarly, a migration state change handler is used to trigger a transition of | 
|  | the VFIO device state when certain changes of the migration state occur. For | 
|  | example, the VFIO device state is transitioned back to _RUNNING in case a | 
|  | migration failed or was canceled. | 
|  |  | 
|  | System memory dirty pages tracking | 
|  | ---------------------------------- | 
|  |  | 
|  | A ``log_global_start`` and ``log_global_stop`` memory listener callback informs | 
|  | the VFIO dirty tracking module to start and stop dirty page tracking. A | 
|  | ``log_sync`` memory listener callback queries the dirty page bitmap from the | 
|  | dirty tracking module and marks system memory pages which were DMA-ed by the | 
|  | VFIO device as dirty. The dirty page bitmap is queried per container. | 
|  |  | 
|  | Currently there are two ways dirty page tracking can be done: | 
|  | (1) Device dirty tracking: | 
|  | In this method the device is responsible to log and report its DMAs. This | 
|  | method can be used only if the device is capable of tracking its DMAs. | 
|  | Discovering device capability, starting and stopping dirty tracking, and | 
|  | syncing the dirty bitmaps from the device are done using the DMA logging uAPI. | 
|  | More info about the uAPI can be found in the comments of the | 
|  | ``vfio_device_feature_dma_logging_control`` and | 
|  | ``vfio_device_feature_dma_logging_report`` structures in the header file | 
|  | linux-headers/linux/vfio.h. | 
|  |  | 
|  | (2) VFIO IOMMU module: | 
|  | In this method dirty tracking is done by IOMMU. However, there is currently no | 
|  | IOMMU support for dirty page tracking. For this reason, all pages are | 
|  | perpetually marked dirty, unless the device driver pins pages through external | 
|  | APIs in which case only those pinned pages are perpetually marked dirty. | 
|  |  | 
|  | If the above two methods are not supported, all pages are perpetually marked | 
|  | dirty by QEMU. | 
|  |  | 
|  | By default, dirty pages are tracked during pre-copy as well as stop-and-copy | 
|  | phase. So, a page marked as dirty will be copied to the destination in both | 
|  | phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can | 
|  | achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding | 
|  | dirty pages continuously, then it understands that even in stop-and-copy phase, | 
|  | it is likely to find dirty pages and can predict the downtime accordingly. | 
|  |  | 
|  | QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` | 
|  | which disables querying the dirty bitmap during pre-copy phase. If it is set to | 
|  | off, all dirty pages will be copied to the destination in stop-and-copy phase | 
|  | only. | 
|  |  | 
|  | System memory dirty pages tracking when vIOMMU is enabled | 
|  | --------------------------------------------------------- | 
|  |  | 
|  | With vIOMMU, an IO virtual address range can get unmapped while in pre-copy | 
|  | phase of migration. In that case, the unmap ioctl returns any dirty pages in | 
|  | that range and QEMU reports corresponding guest physical pages dirty. During | 
|  | stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped | 
|  | pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those | 
|  | mapped ranges. If device dirty tracking is enabled with vIOMMU, live migration | 
|  | will be blocked. | 
|  |  | 
|  | Flow of state changes during Live migration | 
|  | =========================================== | 
|  |  | 
|  | Below is the state change flow during live migration for a VFIO device that | 
|  | supports both precopy and P2P migration. The flow for devices that don't | 
|  | support it is similar, except that the relevant states for precopy and P2P are | 
|  | skipped. | 
|  | The values in the parentheses represent the VM state, the migration state, and | 
|  | the VFIO device state, respectively. | 
|  |  | 
|  | Live migration save path | 
|  | ------------------------ | 
|  |  | 
|  | :: | 
|  |  | 
|  | QEMU normal running state | 
|  | (RUNNING, _NONE, _RUNNING) | 
|  | | | 
|  | migrate_init spawns migration_thread | 
|  | Migration thread then calls each device's .save_setup() | 
|  | (RUNNING, _SETUP, _PRE_COPY) | 
|  | | | 
|  | (RUNNING, _ACTIVE, _PRE_COPY) | 
|  | If device is active, get pending_bytes by .state_pending_{estimate,exact}() | 
|  | If total pending_bytes >= threshold_size, call .save_live_iterate() | 
|  | Data of VFIO device for pre-copy phase is copied | 
|  | Iterate till total pending bytes converge and are less than threshold | 
|  | | | 
|  | On migration completion, the vCPUs and the VFIO device are stopped | 
|  | The VFIO device is first put in P2P quiescent state | 
|  | (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P) | 
|  | | | 
|  | Then the VFIO device is put in _STOP_COPY state | 
|  | (FINISH_MIGRATE, _ACTIVE, _STOP_COPY) | 
|  | .save_live_complete_precopy() is called for each active device | 
|  | For the VFIO device, iterate in .save_live_complete_precopy() until | 
|  | pending data is 0 | 
|  | | | 
|  | (POSTMIGRATE, _COMPLETED, _STOP_COPY) | 
|  | Migraton thread schedules cleanup bottom half and exits | 
|  | | | 
|  | .save_cleanup() is called | 
|  | (POSTMIGRATE, _COMPLETED, _STOP) | 
|  |  | 
|  | Live migration resume path | 
|  | -------------------------- | 
|  |  | 
|  | :: | 
|  |  | 
|  | Incoming migration calls .load_setup() for each device | 
|  | (RESTORE_VM, _ACTIVE, _STOP) | 
|  | | | 
|  | For each device, .load_state() is called for that device section data | 
|  | (RESTORE_VM, _ACTIVE, _RESUMING) | 
|  | | | 
|  | At the end, .load_cleanup() is called for each device and vCPUs are started | 
|  | The VFIO device is first put in P2P quiescent state | 
|  | (RUNNING, _ACTIVE, _RUNNING_P2P) | 
|  | | | 
|  | (RUNNING, _NONE, _RUNNING) | 
|  |  | 
|  | Postcopy | 
|  | ======== | 
|  |  | 
|  | Postcopy migration is currently not supported for VFIO devices. |