| Multi-process QEMU |
| =================== |
| |
| .. note:: |
| |
| This is the design document for multi-process QEMU. It does not |
| necessarily reflect the status of the current implementation, which |
| may lack features or be considerably different from what is described |
| in this document. This document is still useful as a description of |
| the goals and general direction of this feature. |
| |
| Please refer to the following wiki for latest details: |
| https://wiki.qemu.org/Features/MultiProcessQEMU |
| |
| QEMU is often used as the hypervisor for virtual machines running in the |
| Oracle cloud. Since one of the advantages of cloud computing is the |
| ability to run many VMs from different tenants in the same cloud |
| infrastructure, a guest that compromised its hypervisor could |
| potentially use the hypervisor's access privileges to access data it is |
| not authorized for. |
| |
| QEMU can be susceptible to security attacks because it is a large, |
| monolithic program that provides many features to the VMs it services. |
| Many of these features can be configured out of QEMU, but even a reduced |
| configuration QEMU has a large amount of code a guest can potentially |
| attack. Separating QEMU reduces the attack surface by aiding to |
| limit each component in the system to only access the resources that |
| it needs to perform its job. |
| |
| QEMU services |
| ------------- |
| |
| QEMU can be broadly described as providing three main services. One is a |
| VM control point, where VMs can be created, migrated, re-configured, and |
| destroyed. A second is to emulate the CPU instructions within the VM, |
| often accelerated by HW virtualization features such as Intel's VT |
| extensions. Finally, it provides IO services to the VM by emulating HW |
| IO devices, such as disk and network devices. |
| |
| A multi-process QEMU |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| A multi-process QEMU involves separating QEMU services into separate |
| host processes. Each of these processes can be given only the privileges |
| it needs to provide its service, e.g., a disk service could be given |
| access only to the disk images it provides, and not be allowed to |
| access other files, or any network devices. An attacker who compromised |
| this service would not be able to use this exploit to access files or |
| devices beyond what the disk service was given access to. |
| |
| A QEMU control process would remain, but in multi-process mode, will |
| have no direct interfaces to the VM. During VM execution, it would still |
| provide the user interface to hot-plug devices or live migrate the VM. |
| |
| A first step in creating a multi-process QEMU is to separate IO services |
| from the main QEMU program, which would continue to provide CPU |
| emulation. i.e., the control process would also be the CPU emulation |
| process. In a later phase, CPU emulation could be separated from the |
| control process. |
| |
| Separating IO services |
| ---------------------- |
| |
| Separating IO services into individual host processes is a good place to |
| begin for a couple of reasons. One is the sheer number of IO devices QEMU |
| can emulate provides a large surface of interfaces which could potentially |
| be exploited, and, indeed, have been a source of exploits in the past. |
| Another is the modular nature of QEMU device emulation code provides |
| interface points where the QEMU functions that perform device emulation |
| can be separated from the QEMU functions that manage the emulation of |
| guest CPU instructions. The devices emulated in the separate process are |
| referred to as remote devices. |
| |
| QEMU device emulation |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| QEMU uses an object oriented SW architecture for device emulation code. |
| Configured objects are all compiled into the QEMU binary, then objects |
| are instantiated by name when used by the guest VM. For example, the |
| code to emulate a device named "foo" is always present in QEMU, but its |
| instantiation code is only run when the device is included in the target |
| VM. (e.g., via the QEMU command line as *-device foo*) |
| |
| The object model is hierarchical, so device emulation code names its |
| parent object (such as "pci-device" for a PCI device) and QEMU will |
| instantiate a parent object before calling the device's instantiation |
| code. |
| |
| Current separation models |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In order to separate the device emulation code from the CPU emulation |
| code, the device object code must run in a different process. There are |
| a couple of existing QEMU features that can run emulation code |
| separately from the main QEMU process. These are examined below. |
| |
| vhost user model |
| ^^^^^^^^^^^^^^^^ |
| |
| Virtio guest device drivers can be connected to vhost user applications |
| in order to perform their IO operations. This model uses special virtio |
| device drivers in the guest and vhost user device objects in QEMU, but |
| once the QEMU vhost user code has configured the vhost user application, |
| mission-mode IO is performed by the application. The vhost user |
| application is a daemon process that can be contacted via a known UNIX |
| domain socket. |
| |
| vhost socket |
| '''''''''''' |
| |
| As mentioned above, one of the tasks of the vhost device object within |
| QEMU is to contact the vhost application and send it configuration |
| information about this device instance. As part of the configuration |
| process, the application can also be sent other file descriptors over |
| the socket, which then can be used by the vhost user application in |
| various ways, some of which are described below. |
| |
| vhost MMIO store acceleration |
| ''''''''''''''''''''''''''''' |
| |
| VMs are often run using HW virtualization features via the KVM kernel |
| driver. This driver allows QEMU to accelerate the emulation of guest CPU |
| instructions by running the guest in a virtual HW mode. When the guest |
| executes instructions that cannot be executed by virtual HW mode, |
| execution returns to the KVM driver so it can inform QEMU to emulate the |
| instructions in SW. |
| |
| One of the events that can cause a return to QEMU is when a guest device |
| driver accesses an IO location. QEMU then dispatches the memory |
| operation to the corresponding QEMU device object. In the case of a |
| vhost user device, the memory operation would need to be sent over a |
| socket to the vhost application. This path is accelerated by the QEMU |
| virtio code by setting up an eventfd file descriptor that the vhost |
| application can directly receive MMIO store notifications from the KVM |
| driver, instead of needing them to be sent to the QEMU process first. |
| |
| vhost interrupt acceleration |
| '''''''''''''''''''''''''''' |
| |
| Another optimization used by the vhost application is the ability to |
| directly inject interrupts into the VM via the KVM driver, again, |
| bypassing the need to send the interrupt back to the QEMU process first. |
| The QEMU virtio setup code configures the KVM driver with an eventfd |
| that triggers the device interrupt in the guest when the eventfd is |
| written. This irqfd file descriptor is then passed to the vhost user |
| application program. |
| |
| vhost access to guest memory |
| '''''''''''''''''''''''''''' |
| |
| The vhost application is also allowed to directly access guest memory, |
| instead of needing to send the data as messages to QEMU. This is also |
| done with file descriptors sent to the vhost user application by QEMU. |
| These descriptors can be passed to ``mmap()`` by the vhost application |
| to map the guest address space into the vhost application. |
| |
| IOMMUs introduce another level of complexity, since the address given to |
| the guest virtio device to DMA to or from is not a guest physical |
| address. This case is handled by having vhost code within QEMU register |
| as a listener for IOMMU mapping changes. The vhost application maintains |
| a cache of IOMMMU translations: sending translation requests back to |
| QEMU on cache misses, and in turn receiving flush requests from QEMU |
| when mappings are purged. |
| |
| applicability to device separation |
| '''''''''''''''''''''''''''''''''' |
| |
| Much of the vhost model can be re-used by separated device emulation. In |
| particular, the ideas of using a socket between QEMU and the device |
| emulation application, using a file descriptor to inject interrupts into |
| the VM via KVM, and allowing the application to ``mmap()`` the guest |
| should be re used. |
| |
| There are, however, some notable differences between how a vhost |
| application works and the needs of separated device emulation. The most |
| basic is that vhost uses custom virtio device drivers which always |
| trigger IO with MMIO stores. A separated device emulation model must |
| work with existing IO device models and guest device drivers. MMIO loads |
| break vhost store acceleration since they are synchronous - guest |
| progress cannot continue until the load has been emulated. By contrast, |
| stores are asynchronous, the guest can continue after the store event |
| has been sent to the vhost application. |
| |
| Another difference is that in the vhost user model, a single daemon can |
| support multiple QEMU instances. This is contrary to the security regime |
| desired, in which the emulation application should only be allowed to |
| access the files or devices the VM it's running on behalf of can access. |
| #### qemu-io model |
| |
| ``qemu-io`` is a test harness used to test changes to the QEMU block backend |
| object code (e.g., the code that implements disk images for disk driver |
| emulation). ``qemu-io`` is not a device emulation application per se, but it |
| does compile the QEMU block objects into a separate binary from the main |
| QEMU one. This could be useful for disk device emulation, since its |
| emulation applications will need to include the QEMU block objects. |
| |
| New separation model based on proxy objects |
| ------------------------------------------- |
| |
| A different model based on proxy objects in the QEMU program |
| communicating with remote emulation programs could provide separation |
| while minimizing the changes needed to the device emulation code. The |
| rest of this section is a discussion of how a proxy object model would |
| work. |
| |
| Remote emulation processes |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The remote emulation process will run the QEMU object hierarchy without |
| modification. The device emulation objects will be also be based on the |
| QEMU code, because for anything but the simplest device, it would not be |
| a tractable to re-implement both the object model and the many device |
| backends that QEMU has. |
| |
| The processes will communicate with the QEMU process over UNIX domain |
| sockets. The processes can be executed either as standalone processes, |
| or be executed by QEMU. In both cases, the host backends the emulation |
| processes will provide are specified on its command line, as they would |
| be for QEMU. For example: |
| |
| :: |
| |
| disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ |
| -blockdev driver=qcow2,node-name=drive0,file=file0 |
| |
| would indicate process *disk-proc* uses a qcow2 emulated disk named |
| *file0* as its backend. |
| |
| Emulation processes may emulate more than one guest controller. A common |
| configuration might be to put all controllers of the same device class |
| (e.g., disk, network, etc.) in a single process, so that all backends of |
| the same type can be managed by a single QMP monitor. |
| |
| communication with QEMU |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The first argument to the remote emulation process will be a Unix domain |
| socket that connects with the Proxy object. This is a required argument. |
| |
| :: |
| |
| disk-proc <socket number> <backend list> |
| |
| remote process QMP monitor |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Remote emulation processes can be monitored via QMP, similar to QEMU |
| itself. The QMP monitor socket is specified the same as for a QEMU |
| process: |
| |
| :: |
| |
| disk-proc -qmp unix:/tmp/disk-mon,server |
| |
| can be monitored over the UNIX socket path */tmp/disk-mon*. |
| |
| QEMU command line |
| ~~~~~~~~~~~~~~~~~ |
| |
| Each remote device emulated in a remote process on the host is |
| represented as a *-device* of type *pci-proxy-dev*. A socket |
| sub-option to this option specifies the Unix socket that connects |
| to the remote process. An *id* sub-option is required, and it should |
| be the same id as used in the remote process. |
| |
| :: |
| |
| qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 |
| |
| can be used to add a device emulated in a remote process |
| |
| |
| QEMU management of remote processes |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| QEMU is not aware of the type of type of the remote PCI device. It is |
| a pass through device as far as QEMU is concerned. |
| |
| communication with emulation process |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| primary channel |
| ''''''''''''''' |
| |
| The primary channel (referred to as com in the code) is used to bootstrap |
| the remote process. It is also used to pass on device-agnostic commands |
| like reset. |
| |
| per-device channels |
| ''''''''''''''''''' |
| |
| Each remote device communicates with QEMU using a dedicated communication |
| channel. The proxy object sets up this channel using the primary |
| channel during its initialization. |
| |
| QEMU device proxy objects |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| QEMU has an object model based on sub-classes inherited from the |
| "object" super-class. The sub-classes that are of interest here are the |
| "device" and "bus" sub-classes whose child sub-classes make up the |
| device tree of a QEMU emulated system. |
| |
| The proxy object model will use device proxy objects to replace the |
| device emulation code within the QEMU process. These objects will live |
| in the same place in the object and bus hierarchies as the objects they |
| replace. i.e., the proxy object for an LSI SCSI controller will be a |
| sub-class of the "pci-device" class, and will have the same PCI bus |
| parent and the same SCSI bus child objects as the LSI controller object |
| it replaces. |
| |
| It is worth noting that the same proxy object is used to mediate with |
| all types of remote PCI devices. |
| |
| object initialization |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| The Proxy device objects are initialized in the exact same manner in |
| which any other QEMU device would be initialized. |
| |
| In addition, the Proxy objects perform the following two tasks: |
| - Parses the "socket" sub option and connects to the remote process |
| using this channel |
| - Uses the "id" sub-option to connect to the emulated device on the |
| separate process |
| |
| class\_init |
| ''''''''''' |
| |
| The ``class_init()`` method of a proxy object will, in general behave |
| similarly to the object it replaces, including setting any static |
| properties and methods needed by the proxy. |
| |
| instance\_init / realize |
| '''''''''''''''''''''''' |
| |
| The ``instance_init()`` and ``realize()`` functions would only need to |
| perform tasks related to being a proxy, such are registering its own |
| MMIO handlers, or creating a child bus that other proxy devices can be |
| attached to later. |
| |
| Other tasks will be device-specific. For example, PCI device objects |
| will initialize the PCI config space in order to make a valid PCI device |
| tree within the QEMU process. |
| |
| address space registration |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Most devices are driven by guest device driver accesses to IO addresses |
| or ports. The QEMU device emulation code uses QEMU's memory region |
| function calls (such as ``memory_region_init_io()``) to add callback |
| functions that QEMU will invoke when the guest accesses the device's |
| areas of the IO address space. When a guest driver does access the |
| device, the VM will exit HW virtualization mode and return to QEMU, |
| which will then lookup and execute the corresponding callback function. |
| |
| A proxy object would need to mirror the memory region calls the actual |
| device emulator would perform in its initialization code, but with its |
| own callbacks. When invoked by QEMU as a result of a guest IO operation, |
| they will forward the operation to the device emulation process. |
| |
| PCI config space |
| ^^^^^^^^^^^^^^^^ |
| |
| PCI devices also have a configuration space that can be accessed by the |
| guest driver. Guest accesses to this space is not handled by the device |
| emulation object, but by its PCI parent object. Much of this space is |
| read-only, but certain registers (especially BAR and MSI-related ones) |
| need to be propagated to the emulation process. |
| |
| PCI parent proxy |
| '''''''''''''''' |
| |
| One way to propagate guest PCI config accesses is to create a |
| "pci-device-proxy" class that can serve as the parent of a PCI device |
| proxy object. This class's parent would be "pci-device" and it would |
| override the PCI parent's ``config_read()`` and ``config_write()`` |
| methods with ones that forward these operations to the emulation |
| program. |
| |
| interrupt receipt |
| ^^^^^^^^^^^^^^^^^ |
| |
| A proxy for a device that generates interrupts will need to create a |
| socket to receive interrupt indications from the emulation process. An |
| incoming interrupt indication would then be sent up to its bus parent to |
| be injected into the guest. For example, a PCI device object may use |
| ``pci_set_irq()``. |
| |
| live migration |
| ^^^^^^^^^^^^^^ |
| |
| The proxy will register to save and restore any *vmstate* it needs over |
| a live migration event. The device proxy does not need to manage the |
| remote device's *vmstate*; that will be handled by the remote process |
| proxy (see below). |
| |
| QEMU remote device operation |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Generic device operations, such as DMA, will be performed by the remote |
| process proxy by sending messages to the remote process. |
| |
| DMA operations |
| ^^^^^^^^^^^^^^ |
| |
| DMA operations would be handled much like vhost applications do. One of |
| the initial messages sent to the emulation process is a guest memory |
| table. Each entry in this table consists of a file descriptor and size |
| that the emulation process can ``mmap()`` to directly access guest |
| memory, similar to ``vhost_user_set_mem_table()``. Note guest memory |
| must be backed by shared file-backed memory, for example, using |
| *-object memory-backend-file,share=on* and setting that memory backend |
| as RAM for the machine. |
| |
| IOMMU operations |
| ^^^^^^^^^^^^^^^^ |
| |
| When the emulated system includes an IOMMU, the remote process proxy in |
| QEMU will need to create a socket for IOMMU requests from the emulation |
| process. It will handle those requests with an |
| ``address_space_get_iotlb_entry()`` call. In order to handle IOMMU |
| unmaps, the remote process proxy will also register as a listener on the |
| device's DMA address space. When an IOMMU memory region is created |
| within the DMA address space, an IOMMU notifier for unmaps will be added |
| to the memory region that will forward unmaps to the emulation process |
| over the IOMMU socket. |
| |
| device hot-plug via QMP |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| An QMP "device\_add" command can add a device emulated by a remote |
| process. It will also have "rid" option to the command, just as the |
| *-device* command line option does. The remote process may either be one |
| started at QEMU startup, or be one added by the "add-process" QMP |
| command described above. In either case, the remote process proxy will |
| forward the new device's JSON description to the corresponding emulation |
| process. |
| |
| live migration |
| ^^^^^^^^^^^^^^ |
| |
| The remote process proxy will also register for live migration |
| notifications with ``vmstate_register()``. When called to save state, |
| the proxy will send the remote process a secondary socket file |
| descriptor to save the remote process's device *vmstate* over. The |
| incoming byte stream length and data will be saved as the proxy's |
| *vmstate*. When the proxy is resumed on its new host, this *vmstate* |
| will be extracted, and a secondary socket file descriptor will be sent |
| to the new remote process through which it receives the *vmstate* in |
| order to restore the devices there. |
| |
| device emulation in remote process |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The parts of QEMU that the emulation program will need include the |
| object model; the memory emulation objects; the device emulation objects |
| of the targeted device, and any dependent devices; and, the device's |
| backends. It will also need code to setup the machine environment, |
| handle requests from the QEMU process, and route machine-level requests |
| (such as interrupts or IOMMU mappings) back to the QEMU process. |
| |
| initialization |
| ^^^^^^^^^^^^^^ |
| |
| The process initialization sequence will follow the same sequence |
| followed by QEMU. It will first initialize the backend objects, then |
| device emulation objects. The JSON descriptions sent by the QEMU process |
| will drive which objects need to be created. |
| |
| - address spaces |
| |
| Before the device objects are created, the initial address spaces and |
| memory regions must be configured with ``memory_map_init()``. This |
| creates a RAM memory region object (*system\_memory*) and an IO memory |
| region object (*system\_io*). |
| |
| - RAM |
| |
| RAM memory region creation will follow how ``pc_memory_init()`` creates |
| them, but must use ``memory_region_init_ram_from_fd()`` instead of |
| ``memory_region_allocate_system_memory()``. The file descriptors needed |
| will be supplied by the guest memory table from above. Those RAM regions |
| would then be added to the *system\_memory* memory region with |
| ``memory_region_add_subregion()``. |
| |
| - PCI |
| |
| IO initialization will be driven by the JSON descriptions sent from the |
| QEMU process. For a PCI device, a PCI bus will need to be created with |
| ``pci_root_bus_new()``, and a PCI memory region will need to be created |
| and added to the *system\_memory* memory region with |
| ``memory_region_add_subregion_overlap()``. The overlap version is |
| required for architectures where PCI memory overlaps with RAM memory. |
| |
| MMIO handling |
| ^^^^^^^^^^^^^ |
| |
| The device emulation objects will use ``memory_region_init_io()`` to |
| install their MMIO handlers, and ``pci_register_bar()`` to associate |
| those handlers with a PCI BAR, as they do within QEMU currently. |
| |
| In order to use ``address_space_rw()`` in the emulation process to |
| handle MMIO requests from QEMU, the PCI physical addresses must be the |
| same in the QEMU process and the device emulation process. In order to |
| accomplish that, guest BAR programming must also be forwarded from QEMU |
| to the emulation process. |
| |
| interrupt injection |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| When device emulation wants to inject an interrupt into the VM, the |
| request climbs the device's bus object hierarchy until the point where a |
| bus object knows how to signal the interrupt to the guest. The details |
| depend on the type of interrupt being raised. |
| |
| - PCI pin interrupts |
| |
| On x86 systems, there is an emulated IOAPIC object attached to the root |
| PCI bus object, and the root PCI object forwards interrupt requests to |
| it. The IOAPIC object, in turn, calls the KVM driver to inject the |
| corresponding interrupt into the VM. The simplest way to handle this in |
| an emulation process would be to setup the root PCI bus driver (via |
| ``pci_bus_irqs()``) to send a interrupt request back to the QEMU |
| process, and have the device proxy object reflect it up the PCI tree |
| there. |
| |
| - PCI MSI/X interrupts |
| |
| PCI MSI/X interrupts are implemented in HW as DMA writes to a |
| CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives |
| these DMA writes, then calls into the KVM driver to inject the interrupt |
| into the VM. A simple emulation process implementation would be to send |
| the MSI DMA address from QEMU as a message at initialization, then |
| install an address space handler at that address which forwards the MSI |
| message back to QEMU. |
| |
| DMA operations |
| ^^^^^^^^^^^^^^ |
| |
| When a emulation object wants to DMA into or out of guest memory, it |
| first must use dma\_memory\_map() to convert the DMA address to a local |
| virtual address. The emulation process memory region objects setup above |
| will be used to translate the DMA address to a local virtual address the |
| device emulation code can access. |
| |
| IOMMU |
| ^^^^^ |
| |
| When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory |
| regions to translate the DMA address to a guest physical address before |
| that physical address can be translated to a local virtual address. The |
| emulation process will need similar functionality. |
| |
| - IOTLB cache |
| |
| The emulation process will maintain a cache of recent IOMMU translations |
| (the IOTLB). When the translate() callback of an IOMMU memory region is |
| invoked, the IOTLB cache will be searched for an entry that will map the |
| DMA address to a guest PA. On a cache miss, a message will be sent back |
| to QEMU requesting the corresponding translation entry, which be both be |
| used to return a guest address and be added to the cache. |
| |
| - IOTLB purge |
| |
| The IOMMU emulation will also need to act on unmap requests from QEMU. |
| These happen when the guest IOMMU driver purges an entry from the |
| guest's translation table. |
| |
| live migration |
| ^^^^^^^^^^^^^^ |
| |
| When a remote process receives a live migration indication from QEMU, it |
| will set up a channel using the received file descriptor with |
| ``qio_channel_socket_new_fd()``. This channel will be used to create a |
| *QEMUfile* that can be passed to ``qemu_save_device_state()`` to send |
| the process's device state back to QEMU. This method will be reversed on |
| restore - the channel will be passed to ``qemu_loadvm_state()`` to |
| restore the device state. |
| |
| Accelerating device emulation |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The messages that are required to be sent between QEMU and the emulation |
| process can add considerable latency to IO operations. The optimizations |
| described below attempt to ameliorate this effect by allowing the |
| emulation process to communicate directly with the kernel KVM driver. |
| The KVM file descriptors created would be passed to the emulation process |
| via initialization messages, much like the guest memory table is done. |
| #### MMIO acceleration |
| |
| Vhost user applications can receive guest virtio driver stores directly |
| from KVM. The issue with the eventfd mechanism used by vhost user is |
| that it does not pass any data with the event indication, so it cannot |
| handle guest loads or guest stores that carry store data. This concept |
| could, however, be expanded to cover more cases. |
| |
| The expanded idea would require a new type of KVM device: |
| *KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master |
| descriptor that QEMU can use for configuration, and a slave descriptor |
| that the emulation process can use to receive MMIO notifications. QEMU |
| would create both descriptors using the KVM driver, and pass the slave |
| descriptor to the emulation process via an initialization message. |
| |
| data structures |
| ^^^^^^^^^^^^^^^ |
| |
| - guest physical range |
| |
| The guest physical range structure describes the address range that a |
| device will respond to. It includes the base and length of the range, as |
| well as which bus the range resides on (e.g., on an x86machine, it can |
| specify whether the range refers to memory or IO addresses). |
| |
| A device can have multiple physical address ranges it responds to (e.g., |
| a PCI device can have multiple BARs), so the structure will also include |
| an enumerated identifier to specify which of the device's ranges is |
| being referred to. |
| |
| +--------+----------------------------+ |
| | Name | Description | |
| +========+============================+ |
| | addr | range base address | |
| +--------+----------------------------+ |
| | len | range length | |
| +--------+----------------------------+ |
| | bus | addr type (memory or IO) | |
| +--------+----------------------------+ |
| | id | range ID (e.g., PCI BAR) | |
| +--------+----------------------------+ |
| |
| - MMIO request structure |
| |
| This structure describes an MMIO operation. It includes which guest |
| physical range the MMIO was within, the offset within that range, the |
| MMIO type (e.g., load or store), and its length and data. It also |
| includes a sequence number that can be used to reply to the MMIO, and |
| the CPU that issued the MMIO. |
| |
| +----------+------------------------+ |
| | Name | Description | |
| +==========+========================+ |
| | rid | range MMIO is within | |
| +----------+------------------------+ |
| | offset | offset within *rid* | |
| +----------+------------------------+ |
| | type | e.g., load or store | |
| +----------+------------------------+ |
| | len | MMIO length | |
| +----------+------------------------+ |
| | data | store data | |
| +----------+------------------------+ |
| | seq | sequence ID | |
| +----------+------------------------+ |
| |
| - MMIO request queues |
| |
| MMIO request queues are FIFO arrays of MMIO request structures. There |
| are two queues: pending queue is for MMIOs that haven't been read by the |
| emulation program, and the sent queue is for MMIOs that haven't been |
| acknowledged. The main use of the second queue is to validate MMIO |
| replies from the emulation program. |
| |
| - scoreboard |
| |
| Each CPU in the VM is emulated in QEMU by a separate thread, so multiple |
| MMIOs may be waiting to be consumed by an emulation program and multiple |
| threads may be waiting for MMIO replies. The scoreboard would contain a |
| wait queue and sequence number for the per-CPU threads, allowing them to |
| be individually woken when the MMIO reply is received from the emulation |
| program. It also tracks the number of posted MMIO stores to the device |
| that haven't been replied to, in order to satisfy the PCI constraint |
| that a load to a device will not complete until all previous stores to |
| that device have been completed. |
| |
| - device shadow memory |
| |
| Some MMIO loads do not have device side-effects. These MMIOs can be |
| completed without sending a MMIO request to the emulation program if the |
| emulation program shares a shadow image of the device's memory image |
| with the KVM driver. |
| |
| The emulation program will ask the KVM driver to allocate memory for the |
| shadow image, and will then use ``mmap()`` to directly access it. The |
| emulation program can control KVM access to the shadow image by sending |
| KVM an access map telling it which areas of the image have no |
| side-effects (and can be completed immediately), and which require a |
| MMIO request to the emulation program. The access map can also inform |
| the KVM drive which size accesses are allowed to the image. |
| |
| master descriptor |
| ^^^^^^^^^^^^^^^^^ |
| |
| The master descriptor is used by QEMU to configure the new KVM device. |
| The descriptor would be returned by the KVM driver when QEMU issues a |
| *KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. |
| |
| KVM\_DEV\_TYPE\_USER device ops |
| |
| |
| The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a |
| ``kvm_register_device_ops()`` call when the KVM system in initialized by |
| ``kvm_init()``. These device ops are called by the KVM driver when QEMU |
| executes certain ``ioctl()`` operations on its KVM file descriptor. They |
| include: |
| |
| - create |
| |
| This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* |
| ``ioctl()`` on its per-VM file descriptor. It will allocate and |
| initialize a KVM user device specific data structure, and assign the |
| *kvm\_device* private field to it. |
| |
| - ioctl |
| |
| This routine is invoked when QEMU issues an ``ioctl()`` on the master |
| descriptor. The ``ioctl()`` commands supported are defined by the KVM |
| device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: |
| |
| *KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will |
| be passed to the device emulation program. Only one slave can be created |
| by each master descriptor. The file operations performed by this |
| descriptor are described below. |
| |
| The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical |
| address range that the slave descriptor will receive MMIO notifications |
| for. The range is specified by a guest physical range structure |
| argument. For buses that assign addresses to devices dynamically, this |
| command can be executed while the guest is running, such as the case |
| when a guest changes a device's PCI BAR registers. |
| |
| *KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to |
| register *kvm\_io\_device\_ops* callbacks to be invoked when the guest |
| performs a MMIO operation within the range. When a range is changed, |
| ``kvm_io_bus_unregister_dev()`` is used to remove the previous |
| instantiation. |
| |
| *KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies |
| how long KVM will wait for the emulation process to respond to a MMIO |
| indication. |
| |
| - destroy |
| |
| This routine is called when the VM instance is destroyed. It will need |
| to destroy the slave descriptor; and free any memory allocated by the |
| driver, as well as the *kvm\_device* structure itself. |
| |
| slave descriptor |
| ^^^^^^^^^^^^^^^^ |
| |
| The slave descriptor will have its own file operations vector, which |
| responds to system calls on the descriptor performed by the device |
| emulation program. |
| |
| - read |
| |
| A read returns any pending MMIO requests from the KVM driver as MMIO |
| request structures. Multiple structures can be returned if there are |
| multiple MMIO operations pending. The MMIO requests are moved from the |
| pending queue to the sent queue, and if there are threads waiting for |
| space in the pending to add new MMIO operations, they will be woken |
| here. |
| |
| - write |
| |
| A write also consists of a set of MMIO requests. They are compared to |
| the MMIO requests in the sent queue. Matches are removed from the sent |
| queue, and any threads waiting for the reply are woken. If a store is |
| removed, then the number of posted stores in the per-CPU scoreboard is |
| decremented. When the number is zero, and a non side-effect load was |
| waiting for posted stores to complete, the load is continued. |
| |
| - ioctl |
| |
| There are several ioctl()s that can be performed on the slave |
| descriptor. |
| |
| A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to |
| allocate memory for the shadow image. This memory can later be |
| ``mmap()``\ ed by the emulation process to share the emulation's view of |
| device memory with the KVM driver. |
| |
| A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the |
| shadow image. It will send the KVM driver a shadow control map, which |
| specifies which areas of the image can complete guest loads without |
| sending the load request to the emulation program. It will also specify |
| the size of load operations that are allowed. |
| |
| - poll |
| |
| An emulation program will use the ``poll()`` call with a *POLLIN* flag |
| to determine if there are MMIO requests waiting to be read. It will |
| return if the pending MMIO request queue is not empty. |
| |
| - mmap |
| |
| This call allows the emulation program to directly access the shadow |
| image allocated by the KVM driver. As device emulation updates device |
| memory, changes with no side-effects will be reflected in the shadow, |
| and the KVM driver can satisfy guest loads from the shadow image without |
| needing to wait for the emulation program. |
| |
| kvm\_io\_device ops |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| Each KVM per-CPU thread can handle MMIO operation on behalf of the guest |
| VM. KVM will use the MMIO's guest physical address to search for a |
| matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM |
| driver instead of exiting back to QEMU. If a match is found, the |
| corresponding callback will be invoked. |
| |
| - read |
| |
| This callback is invoked when the guest performs a load to the device. |
| Loads with side-effects must be handled synchronously, with the KVM |
| driver putting the QEMU thread to sleep waiting for the emulation |
| process reply before re-starting the guest. Loads that do not have |
| side-effects may be optimized by satisfying them from the shadow image, |
| if there are no outstanding stores to the device by this CPU. PCI memory |
| ordering demands that a load cannot complete before all older stores to |
| the same device have been completed. |
| |
| - write |
| |
| Stores can be handled asynchronously unless the pending MMIO request |
| queue is full. In this case, the QEMU thread must sleep waiting for |
| space in the queue. Stores will increment the number of posted stores in |
| the per-CPU scoreboard, in order to implement the PCI ordering |
| constraint above. |
| |
| interrupt acceleration |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| This performance optimization would work much like a vhost user |
| application does, where the QEMU process sets up *eventfds* that cause |
| the device's corresponding interrupt to be triggered by the KVM driver. |
| These irq file descriptors are sent to the emulation process at |
| initialization, and are used when the emulation code raises a device |
| interrupt. |
| |
| intx acceleration |
| ''''''''''''''''' |
| |
| Traditional PCI pin interrupts are level based, so, in addition to an |
| irq file descriptor, a re-sampling file descriptor needs to be sent to |
| the emulation program. This second file descriptor allows multiple |
| devices sharing an irq to be notified when the interrupt has been |
| acknowledged by the guest, so they can re-trigger the interrupt if their |
| device has not de-asserted its interrupt. |
| |
| intx irq descriptor |
| |
| |
| The irq descriptors are created by the proxy object |
| ``using event_notifier_init()`` to create the irq and re-sampling |
| *eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. |
| The interrupt route can be found with |
| ``pci_device_route_intx_to_irq()``. |
| |
| intx routing changes |
| |
| |
| Intx routing can be changed when the guest programs the APIC the device |
| pin is connected to. The proxy object in QEMU will use |
| ``pci_device_set_intx_routing_notifier()`` to be informed of any guest |
| changes to the route. This handler will broadly follow the VFIO |
| interrupt logic to change the route: de-assigning the existing irq |
| descriptor from its route, then assigning it the new route. (see |
| ``vfio_intx_update()``) |
| |
| MSI/X acceleration |
| '''''''''''''''''' |
| |
| MSI/X interrupts are sent as DMA transactions to the host. The interrupt |
| data contains a vector that is programmed by the guest, A device may have |
| multiple MSI interrupts associated with it, so multiple irq descriptors |
| may need to be sent to the emulation program. |
| |
| MSI/X irq descriptor |
| |
| |
| This case will also follow the VFIO example. For each MSI/X interrupt, |
| an *eventfd* is created, a virtual interrupt is allocated by |
| ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to |
| the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. |
| |
| MSI/X config space changes |
| |
| |
| The guest may dynamically update several MSI-related tables in the |
| device's PCI config space. These include per-MSI interrupt enables and |
| vector data. Additionally, MSIX tables exist in device memory space, not |
| config space. Much like the BAR case above, the proxy object must look |
| at guest config space programming to keep the MSI interrupt state |
| consistent between QEMU and the emulation program. |
| |
| -------------- |
| |
| Disaggregated CPU emulation |
| --------------------------- |
| |
| After IO services have been disaggregated, a second phase would be to |
| separate a process to handle CPU instruction emulation from the main |
| QEMU control function. There are no object separation points for this |
| code, so the first task would be to create one. |
| |
| Host access controls |
| -------------------- |
| |
| Separating QEMU relies on the host OS's access restriction mechanisms to |
| enforce that the differing processes can only access the objects they |
| are entitled to. There are a couple types of mechanisms usually provided |
| by general purpose OSs. |
| |
| Discretionary access control |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Discretionary access control allows each user to control who can access |
| their files. In Linux, this type of control is usually too coarse for |
| QEMU separation, since it only provides three separate access controls: |
| one for the same user ID, the second for users IDs with the same group |
| ID, and the third for all other user IDs. Each device instance would |
| need a separate user ID to provide access control, which is likely to be |
| unwieldy for dynamically created VMs. |
| |
| Mandatory access control |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Mandatory access control allows the OS to add an additional set of |
| controls on top of discretionary access for the OS to control. It also |
| adds other attributes to processes and files such as types, roles, and |
| categories, and can establish rules for how processes and files can |
| interact. |
| |
| Type enforcement |
| ^^^^^^^^^^^^^^^^ |
| |
| Type enforcement assigns a *type* attribute to processes and files, and |
| allows rules to be written on what operations a process with a given |
| type can perform on a file with a given type. QEMU separation could take |
| advantage of type enforcement by running the emulation processes with |
| different types, both from the main QEMU process, and from the emulation |
| processes of different classes of devices. |
| |
| For example, guest disk images and disk emulation processes could have |
| types separate from the main QEMU process and non-disk emulation |
| processes, and the type rules could prevent processes other than disk |
| emulation ones from accessing guest disk images. Similarly, network |
| emulation processes can have a type separate from the main QEMU process |
| and non-network emulation process, and only that type can access the |
| host tun/tap device used to provide guest networking. |
| |
| Category enforcement |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| Category enforcement assigns a set of numbers within a given range to |
| the process or file. The process is granted access to the file if the |
| process's set is a superset of the file's set. This enforcement can be |
| used to separate multiple instances of devices in the same class. |
| |
| For example, if there are multiple disk devices provides to a guest, |
| each device emulation process could be provisioned with a separate |
| category. The different device emulation processes would not be able to |
| access each other's backing disk images. |
| |
| Alternatively, categories could be used in lieu of the type enforcement |
| scheme described above. In this scenario, different categories would be |
| used to prevent device emulation processes in different classes from |
| accessing resources assigned to other classes. |