| .. _vhost_user_proto: |
| |
| =================== |
| Vhost-user Protocol |
| =================== |
| |
| .. |
| Copyright 2014 Virtual Open Systems Sarl. |
| Copyright 2019 Intel Corporation |
| Licence: This work is licensed under the terms of the GNU GPL, |
| version 2 or later. See the COPYING file in the top-level |
| directory. |
| |
| .. contents:: Table of Contents |
| |
| Introduction |
| ============ |
| |
| This protocol is aiming to complement the ``ioctl`` interface used to |
| control the vhost implementation in the Linux kernel. It implements |
| the control plane needed to establish virtqueue sharing with a user |
| space process on the same host. It uses communication over a Unix |
| domain socket to share file descriptors in the ancillary data of the |
| message. |
| |
| The protocol defines 2 sides of the communication, *master* and |
| *slave*. *Master* is the application that shares its virtqueues, in |
| our case QEMU. *Slave* is the consumer of the virtqueues. |
| |
| In the current implementation QEMU is the *master*, and the *slave* is |
| the external process consuming the virtio queues, for example a |
| software Ethernet switch running in user space, such as Snabbswitch, |
| or a block device backend processing read & write to a virtual |
| disk. In order to facilitate interoperability between various backend |
| implementations, it is recommended to follow the :ref:`Backend program |
| conventions <backend_conventions>`. |
| |
| *Master* and *slave* can be either a client (i.e. connecting) or |
| server (listening) in the socket communication. |
| |
| Support for platforms other than Linux |
| -------------------------------------- |
| |
| While vhost-user was initially developed targeting Linux, nowadays it |
| is supported on any platform that provides the following features: |
| |
| - A way for requesting shared memory represented by a file descriptor |
| so it can be passed over a UNIX domain socket and then mapped by the |
| other process. |
| |
| - AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can |
| exchange messages through it, including ancillary data when needed. |
| |
| - Either eventfd or pipe/pipe2. On platforms where eventfd is not |
| available, QEMU will automatically fall back to pipe2 or, as a last |
| resort, pipe. Each file descriptor will be used for receiving or |
| sending events by reading or writing (respectively) an 8-byte value |
| to the corresponding it. The 8-value itself has no meaning and |
| should not be interpreted. |
| |
| Message Specification |
| ===================== |
| |
| .. Note:: All numbers are in the machine native byte order. |
| |
| A vhost-user message consists of 3 header fields and a payload. |
| |
| +---------+-------+------+---------+ |
| | request | flags | size | payload | |
| +---------+-------+------+---------+ |
| |
| Header |
| ------ |
| |
| :request: 32-bit type of the request |
| |
| :flags: 32-bit bit field |
| |
| - Lower 2 bits are the version (currently 0x01) |
| - Bit 2 is the reply flag - needs to be sent on each reply from the slave |
| - Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for |
| details. |
| |
| :size: 32-bit size of the payload |
| |
| Payload |
| ------- |
| |
| Depending on the request type, **payload** can be: |
| |
| A single 64-bit integer |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +-----+ |
| | u64 | |
| +-----+ |
| |
| :u64: a 64-bit unsigned integer |
| |
| A vring state description |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +-------+-----+ |
| | index | num | |
| +-------+-----+ |
| |
| :index: a 32-bit index |
| |
| :num: a 32-bit number |
| |
| A vring address description |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +-------+-------+------+------------+------+-----------+-----+ |
| | index | flags | size | descriptor | used | available | log | |
| +-------+-------+------+------------+------+-----------+-----+ |
| |
| :index: a 32-bit vring index |
| |
| :flags: a 32-bit vring flags |
| |
| :descriptor: a 64-bit ring address of the vring descriptor table |
| |
| :used: a 64-bit ring address of the vring used ring |
| |
| :available: a 64-bit ring address of the vring available ring |
| |
| :log: a 64-bit guest address for logging |
| |
| Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has |
| been negotiated. Otherwise it is a user address. |
| |
| Memory regions description |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +-------------+---------+---------+-----+---------+ |
| | num regions | padding | region0 | ... | region7 | |
| +-------------+---------+---------+-----+---------+ |
| |
| :num regions: a 32-bit number of regions |
| |
| :padding: 32-bit |
| |
| A region is: |
| |
| +---------------+------+--------------+-------------+ |
| | guest address | size | user address | mmap offset | |
| +---------------+------+--------------+-------------+ |
| |
| :guest address: a 64-bit guest address of the region |
| |
| :size: a 64-bit size |
| |
| :user address: a 64-bit user address |
| |
| :mmap offset: 64-bit offset where region starts in the mapped memory |
| |
| Single memory region description |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +---------+---------------+------+--------------+-------------+ |
| | padding | guest address | size | user address | mmap offset | |
| +---------+---------------+------+--------------+-------------+ |
| |
| :padding: 64-bit |
| |
| :guest address: a 64-bit guest address of the region |
| |
| :size: a 64-bit size |
| |
| :user address: a 64-bit user address |
| |
| :mmap offset: 64-bit offset where region starts in the mapped memory |
| |
| Log description |
| ^^^^^^^^^^^^^^^ |
| |
| +----------+------------+ |
| | log size | log offset | |
| +----------+------------+ |
| |
| :log size: size of area used for logging |
| |
| :log offset: offset from start of supplied file descriptor where |
| logging starts (i.e. where guest address 0 would be |
| logged) |
| |
| An IOTLB message |
| ^^^^^^^^^^^^^^^^ |
| |
| +------+------+--------------+-------------------+------+ |
| | iova | size | user address | permissions flags | type | |
| +------+------+--------------+-------------------+------+ |
| |
| :iova: a 64-bit I/O virtual address programmed by the guest |
| |
| :size: a 64-bit size |
| |
| :user address: a 64-bit user address |
| |
| :permissions flags: an 8-bit value: |
| - 0: No access |
| - 1: Read access |
| - 2: Write access |
| - 3: Read/Write access |
| |
| :type: an 8-bit IOTLB message type: |
| - 1: IOTLB miss |
| - 2: IOTLB update |
| - 3: IOTLB invalidate |
| - 4: IOTLB access fail |
| |
| Virtio device config space |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +--------+------+-------+---------+ |
| | offset | size | flags | payload | |
| +--------+------+-------+---------+ |
| |
| :offset: a 32-bit offset of virtio device's configuration space |
| |
| :size: a 32-bit configuration space access size in bytes |
| |
| :flags: a 32-bit value: |
| - 0: Vhost master messages used for writeable fields |
| - 1: Vhost master messages used for live migration |
| |
| :payload: Size bytes array holding the contents of the virtio |
| device's configuration space |
| |
| Vring area description |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| +-----+------+--------+ |
| | u64 | size | offset | |
| +-----+------+--------+ |
| |
| :u64: a 64-bit integer contains vring index and flags |
| |
| :size: a 64-bit size of this area |
| |
| :offset: a 64-bit offset of this area from the start of the |
| supplied file descriptor |
| |
| Inflight description |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| +-----------+-------------+------------+------------+ |
| | mmap size | mmap offset | num queues | queue size | |
| +-----------+-------------+------------+------------+ |
| |
| :mmap size: a 64-bit size of area to track inflight I/O |
| |
| :mmap offset: a 64-bit offset of this area from the start |
| of the supplied file descriptor |
| |
| :num queues: a 16-bit number of virtqueues |
| |
| :queue size: a 16-bit size of virtqueues |
| |
| C structure |
| ----------- |
| |
| In QEMU the vhost-user message is implemented with the following struct: |
| |
| .. code:: c |
| |
| typedef struct VhostUserMsg { |
| VhostUserRequest request; |
| uint32_t flags; |
| uint32_t size; |
| union { |
| uint64_t u64; |
| struct vhost_vring_state state; |
| struct vhost_vring_addr addr; |
| VhostUserMemory memory; |
| VhostUserLog log; |
| struct vhost_iotlb_msg iotlb; |
| VhostUserConfig config; |
| VhostUserVringArea area; |
| VhostUserInflight inflight; |
| }; |
| } QEMU_PACKED VhostUserMsg; |
| |
| Communication |
| ============= |
| |
| The protocol for vhost-user is based on the existing implementation of |
| vhost for the Linux Kernel. Most messages that can be sent via the |
| Unix domain socket implementing vhost-user have an equivalent ioctl to |
| the kernel implementation. |
| |
| The communication consists of *master* sending message requests and |
| *slave* sending message replies. Most of the requests don't require |
| replies. Here is a list of the ones that do: |
| |
| * ``VHOST_USER_GET_FEATURES`` |
| * ``VHOST_USER_GET_PROTOCOL_FEATURES`` |
| * ``VHOST_USER_GET_VRING_BASE`` |
| * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) |
| * ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) |
| |
| .. seealso:: |
| |
| :ref:`REPLY_ACK <reply_ack>` |
| The section on ``REPLY_ACK`` protocol extension. |
| |
| There are several messages that the master sends with file descriptors passed |
| in the ancillary data: |
| |
| * ``VHOST_USER_SET_MEM_TABLE`` |
| * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) |
| * ``VHOST_USER_SET_LOG_FD`` |
| * ``VHOST_USER_SET_VRING_KICK`` |
| * ``VHOST_USER_SET_VRING_CALL`` |
| * ``VHOST_USER_SET_VRING_ERR`` |
| * ``VHOST_USER_SET_SLAVE_REQ_FD`` |
| * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) |
| |
| If *master* is unable to send the full message or receives a wrong |
| reply it will close the connection. An optional reconnection mechanism |
| can be implemented. |
| |
| If *slave* detects some error such as incompatible features, it may also |
| close the connection. This should only happen in exceptional circumstances. |
| |
| Any protocol extensions are gated by protocol feature bits, which |
| allows full backwards compatibility on both master and slave. As |
| older slaves don't support negotiating protocol features, a feature |
| bit was dedicated for this purpose:: |
| |
| #define VHOST_USER_F_PROTOCOL_FEATURES 30 |
| |
| Starting and stopping rings |
| --------------------------- |
| |
| Client must only process each ring when it is started. |
| |
| Client must only pass data between the ring and the backend, when the |
| ring is enabled. |
| |
| If ring is started but disabled, client must process the ring without |
| talking to the backend. |
| |
| For example, for a networking device, in the disabled state client |
| must not supply any new RX packets, but must process and discard any |
| TX packets. |
| |
| If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the |
| ring is initialized in an enabled state. |
| |
| If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is |
| initialized in a disabled state. Client must not pass data to/from the |
| backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with |
| parameter 1, or after it has been disabled by |
| ``VHOST_USER_SET_VRING_ENABLE`` with parameter 0. |
| |
| Each ring is initialized in a stopped state, client must not process |
| it until ring is started, or after it has been stopped. |
| |
| Client must start ring upon receiving a kick (that is, detecting that |
| file descriptor is readable) on the descriptor specified by |
| ``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message |
| ``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving |
| ``VHOST_USER_GET_VRING_BASE``. |
| |
| While processing the rings (whether they are enabled or not), client |
| must support changing some configuration aspects on the fly. |
| |
| Multiple queue support |
| ---------------------- |
| |
| Many devices have a fixed number of virtqueues. In this case the master |
| already knows the number of available virtqueues without communicating with the |
| slave. |
| |
| Some devices do not have a fixed number of virtqueues. Instead the maximum |
| number of virtqueues is chosen by the slave. The number can depend on host |
| resource availability or slave implementation details. Such devices are called |
| multiple queue devices. |
| |
| Multiple queue support allows the slave to advertise the maximum number of |
| queues. This is treated as a protocol extension, hence the slave has to |
| implement protocol features first. The multiple queues feature is supported |
| only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set. |
| |
| The max number of queues the slave supports can be queried with message |
| ``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested |
| queues is bigger than that. |
| |
| As all queues share one connection, the master uses a unique index for each |
| queue in the sent message to identify a specified queue. |
| |
| The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``. |
| vhost-user-net has historically automatically enabled the first queue pair. |
| |
| Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol |
| feature, even for devices with a fixed number of virtqueues, since it is simple |
| to implement and offers a degree of introspection. |
| |
| Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for |
| devices with a fixed number of virtqueues. Only true multiqueue devices |
| require this protocol feature. |
| |
| Migration |
| --------- |
| |
| During live migration, the master may need to track the modifications |
| the slave makes to the memory mapped regions. The client should mark |
| the dirty pages in a log. Once it complies to this logging, it may |
| declare the ``VHOST_F_LOG_ALL`` vhost feature. |
| |
| To start/stop logging of data/used ring writes, server may send |
| messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and |
| ``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's |
| flags set to 1/0, respectively. |
| |
| All the modifications to memory pointed by vring "descriptor" should |
| be marked. Modifications to "used" vring should be marked if |
| ``VHOST_VRING_F_LOG`` is part of ring's flags. |
| |
| Dirty pages are of size:: |
| |
| #define VHOST_LOG_PAGE 0x1000 |
| |
| The log memory fd is provided in the ancillary data of |
| ``VHOST_USER_SET_LOG_BASE`` message when the slave has |
| ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature. |
| |
| The size of the log is supplied as part of ``VhostUserMsg`` which |
| should be large enough to cover all known guest addresses. Log starts |
| at the supplied offset in the supplied file descriptor. The log |
| covers from address 0 to the maximum of guest regions. In pseudo-code, |
| to mark page at ``addr`` as dirty:: |
| |
| page = addr / VHOST_LOG_PAGE |
| log[page / 8] |= 1 << page % 8 |
| |
| Where ``addr`` is the guest physical address. |
| |
| Use atomic operations, as the log may be concurrently manipulated. |
| |
| Note that when logging modifications to the used ring (when |
| ``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should |
| be used to calculate the log offset: the write to first byte of the |
| used ring is logged at this offset from log start. Also note that this |
| value might be outside the legal guest physical address range |
| (i.e. does not have to be covered by the ``VhostUserMemory`` table), but |
| the bit offset of the last byte of the ring must fall within the size |
| supplied by ``VhostUserLog``. |
| |
| ``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in |
| ancillary data, it may be used to inform the master that the log has |
| been modified. |
| |
| Once the source has finished migration, rings will be stopped by the |
| source. No further update must be done before rings are restarted. |
| |
| In postcopy migration the slave is started before all the memory has |
| been received from the source host, and care must be taken to avoid |
| accessing pages that have yet to be received. The slave opens a |
| 'userfault'-fd and registers the memory with it; this fd is then |
| passed back over to the master. The master services requests on the |
| userfaultfd for pages that are accessed and when the page is available |
| it performs WAKE ioctl's on the userfaultfd to wake the stalled |
| slave. The client indicates support for this via the |
| ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. |
| |
| Memory access |
| ------------- |
| |
| The master sends a list of vhost memory regions to the slave using the |
| ``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base |
| addresses: a guest address and a user address. |
| |
| Messages contain guest addresses and/or user addresses to reference locations |
| within the shared memory. The mapping of these addresses works as follows. |
| |
| User addresses map to the vhost memory region containing that user address. |
| |
| When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated: |
| |
| * Guest addresses map to the vhost memory region containing that guest |
| address. |
| |
| When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated: |
| |
| * Guest addresses are also called I/O virtual addresses (IOVAs). They are |
| translated to user addresses via the IOTLB. |
| |
| * The vhost memory region guest address is not used. |
| |
| IOMMU support |
| ------------- |
| |
| When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the |
| master sends IOTLB entries update & invalidation by sending |
| ``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct |
| vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload |
| has to be filled with the update message type (2), the I/O virtual |
| address, the size, the user virtual address, and the permissions |
| flags. Addresses and size must be within vhost memory regions set via |
| the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the |
| ``iotlb`` payload has to be filled with the invalidation message type |
| (3), the I/O virtual address and the size. On success, the slave is |
| expected to reply with a zero payload, non-zero otherwise. |
| |
| The slave relies on the slave communication channel (see :ref:`Slave |
| communication <slave_communication>` section below) to send IOTLB miss |
| and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG`` |
| requests to the master with a ``struct vhost_iotlb_msg`` as |
| payload. For miss events, the iotlb payload has to be filled with the |
| miss message type (1), the I/O virtual address and the permissions |
| flags. For access failure event, the iotlb payload has to be filled |
| with the access failure message type (4), the I/O virtual address and |
| the permissions flags. For synchronization purpose, the slave may |
| rely on the reply-ack feature, so the master may send a reply when |
| operation is completed if the reply-ack feature is negotiated and |
| slaves requests a reply. For miss events, completed operation means |
| either master sent an update message containing the IOTLB entry |
| containing requested address and permission, or master sent nothing if |
| the IOTLB miss message is invalid (invalid IOVA or permission). |
| |
| The master isn't expected to take the initiative to send IOTLB update |
| messages, as the slave sends IOTLB miss messages for the guest virtual |
| memory areas it needs to access. |
| |
| .. _slave_communication: |
| |
| Slave communication |
| ------------------- |
| |
| An optional communication channel is provided if the slave declares |
| ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the |
| slave to make requests to the master. |
| |
| The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data. |
| |
| A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master |
| using this fd communication channel. |
| |
| If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is |
| negotiated, slave can send file descriptors (at most 8 descriptors in |
| each message) to master via ancillary data using this fd communication |
| channel. |
| |
| Inflight I/O tracking |
| --------------------- |
| |
| To support reconnecting after restart or crash, slave may need to |
| resubmit inflight I/Os. If virtqueue is processed in order, we can |
| easily achieve that by getting the inflight descriptors from |
| descriptor table (split virtqueue) or descriptor ring (packed |
| virtqueue). However, it can't work when we process descriptors |
| out-of-order because some entries which store the information of |
| inflight descriptors in available ring (split virtqueue) or descriptor |
| ring (packed virtqueue) might be overridden by new entries. To solve |
| this problem, slave need to allocate an extra buffer to store this |
| information of inflight descriptors and share it with master for |
| persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and |
| ``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer |
| between master and slave. And the format of this buffer is described |
| below: |
| |
| +---------------+---------------+-----+---------------+ |
| | queue0 region | queue1 region | ... | queueN region | |
| +---------------+---------------+-----+---------------+ |
| |
| N is the number of available virtqueues. Slave could get it from num |
| queues field of ``VhostUserInflight``. |
| |
| For split virtqueue, queue region can be implemented as: |
| |
| .. code:: c |
| |
| typedef struct DescStateSplit { |
| /* Indicate whether this descriptor is inflight or not. |
| * Only available for head-descriptor. */ |
| uint8_t inflight; |
| |
| /* Padding */ |
| uint8_t padding[5]; |
| |
| /* Maintain a list for the last batch of used descriptors. |
| * Only available when batching is used for submitting */ |
| uint16_t next; |
| |
| /* Used to preserve the order of fetching available descriptors. |
| * Only available for head-descriptor. */ |
| uint64_t counter; |
| } DescStateSplit; |
| |
| typedef struct QueueRegionSplit { |
| /* The feature flags of this region. Now it's initialized to 0. */ |
| uint64_t features; |
| |
| /* The version of this region. It's 1 currently. |
| * Zero value indicates an uninitialized buffer */ |
| uint16_t version; |
| |
| /* The size of DescStateSplit array. It's equal to the virtqueue |
| * size. Slave could get it from queue size field of VhostUserInflight. */ |
| uint16_t desc_num; |
| |
| /* The head of list that track the last batch of used descriptors. */ |
| uint16_t last_batch_head; |
| |
| /* Store the idx value of used ring */ |
| uint16_t used_idx; |
| |
| /* Used to track the state of each descriptor in descriptor table */ |
| DescStateSplit desc[]; |
| } QueueRegionSplit; |
| |
| To track inflight I/O, the queue region should be processed as follows: |
| |
| When receiving available buffers from the driver: |
| |
| #. Get the next available head-descriptor index from available ring, ``i`` |
| |
| #. Set ``desc[i].counter`` to the value of global counter |
| |
| #. Increase global counter by 1 |
| |
| #. Set ``desc[i].inflight`` to 1 |
| |
| When supplying used buffers to the driver: |
| |
| 1. Get corresponding used head-descriptor index, i |
| |
| 2. Set ``desc[i].next`` to ``last_batch_head`` |
| |
| 3. Set ``last_batch_head`` to ``i`` |
| |
| #. Steps 1,2,3 may be performed repeatedly if batching is possible |
| |
| #. Increase the ``idx`` value of used ring by the size of the batch |
| |
| #. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0 |
| |
| #. Set ``used_idx`` to the ``idx`` value of used ring |
| |
| When reconnecting: |
| |
| #. If the value of ``used_idx`` does not match the ``idx`` value of |
| used ring (means the inflight field of ``DescStateSplit`` entries in |
| last batch may be incorrect), |
| |
| a. Subtract the value of ``used_idx`` from the ``idx`` value of |
| used ring to get last batch size of ``DescStateSplit`` entries |
| |
| #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch |
| list which starts from ``last_batch_head`` |
| |
| #. Set ``used_idx`` to the ``idx`` value of used ring |
| |
| #. Resubmit inflight ``DescStateSplit`` entries in order of their |
| counter value |
| |
| For packed virtqueue, queue region can be implemented as: |
| |
| .. code:: c |
| |
| typedef struct DescStatePacked { |
| /* Indicate whether this descriptor is inflight or not. |
| * Only available for head-descriptor. */ |
| uint8_t inflight; |
| |
| /* Padding */ |
| uint8_t padding; |
| |
| /* Link to the next free entry */ |
| uint16_t next; |
| |
| /* Link to the last entry of descriptor list. |
| * Only available for head-descriptor. */ |
| uint16_t last; |
| |
| /* The length of descriptor list. |
| * Only available for head-descriptor. */ |
| uint16_t num; |
| |
| /* Used to preserve the order of fetching available descriptors. |
| * Only available for head-descriptor. */ |
| uint64_t counter; |
| |
| /* The buffer id */ |
| uint16_t id; |
| |
| /* The descriptor flags */ |
| uint16_t flags; |
| |
| /* The buffer length */ |
| uint32_t len; |
| |
| /* The buffer address */ |
| uint64_t addr; |
| } DescStatePacked; |
| |
| typedef struct QueueRegionPacked { |
| /* The feature flags of this region. Now it's initialized to 0. */ |
| uint64_t features; |
| |
| /* The version of this region. It's 1 currently. |
| * Zero value indicates an uninitialized buffer */ |
| uint16_t version; |
| |
| /* The size of DescStatePacked array. It's equal to the virtqueue |
| * size. Slave could get it from queue size field of VhostUserInflight. */ |
| uint16_t desc_num; |
| |
| /* The head of free DescStatePacked entry list */ |
| uint16_t free_head; |
| |
| /* The old head of free DescStatePacked entry list */ |
| uint16_t old_free_head; |
| |
| /* The used index of descriptor ring */ |
| uint16_t used_idx; |
| |
| /* The old used index of descriptor ring */ |
| uint16_t old_used_idx; |
| |
| /* Device ring wrap counter */ |
| uint8_t used_wrap_counter; |
| |
| /* The old device ring wrap counter */ |
| uint8_t old_used_wrap_counter; |
| |
| /* Padding */ |
| uint8_t padding[7]; |
| |
| /* Used to track the state of each descriptor fetched from descriptor ring */ |
| DescStatePacked desc[]; |
| } QueueRegionPacked; |
| |
| To track inflight I/O, the queue region should be processed as follows: |
| |
| When receiving available buffers from the driver: |
| |
| #. Get the next available descriptor entry from descriptor ring, ``d`` |
| |
| #. If ``d`` is head descriptor, |
| |
| a. Set ``desc[old_free_head].num`` to 0 |
| |
| #. Set ``desc[old_free_head].counter`` to the value of global counter |
| |
| #. Increase global counter by 1 |
| |
| #. Set ``desc[old_free_head].inflight`` to 1 |
| |
| #. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to |
| ``free_head`` |
| |
| #. Increase ``desc[old_free_head].num`` by 1 |
| |
| #. Set ``desc[free_head].addr``, ``desc[free_head].len``, |
| ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``, |
| ``d.len``, ``d.flags``, ``d.id`` |
| |
| #. Set ``free_head`` to ``desc[free_head].next`` |
| |
| #. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head`` |
| |
| When supplying used buffers to the driver: |
| |
| 1. Get corresponding used head-descriptor entry from descriptor ring, |
| ``d`` |
| |
| 2. Get corresponding ``DescStatePacked`` entry, ``e`` |
| |
| 3. Set ``desc[e.last].next`` to ``free_head`` |
| |
| 4. Set ``free_head`` to the index of ``e`` |
| |
| #. Steps 1,2,3,4 may be performed repeatedly if batching is possible |
| |
| #. Increase ``used_idx`` by the size of the batch and update |
| ``used_wrap_counter`` if needed |
| |
| #. Update ``d.flags`` |
| |
| #. Set the ``inflight`` field of each head ``DescStatePacked`` entry |
| in the batch to 0 |
| |
| #. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` |
| to ``free_head``, ``used_idx``, ``used_wrap_counter`` |
| |
| When reconnecting: |
| |
| #. If ``used_idx`` does not match ``old_used_idx`` (means the |
| ``inflight`` field of ``DescStatePacked`` entries in last batch may |
| be incorrect), |
| |
| a. Get the next descriptor ring entry through ``old_used_idx``, ``d`` |
| |
| #. Use ``old_used_wrap_counter`` to calculate the available flags |
| |
| #. If ``d.flags`` is not equal to the calculated flags value (means |
| slave has submitted the buffer to guest driver before crash, so |
| it has to commit the in-progres update), set ``old_free_head``, |
| ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``, |
| ``used_idx``, ``used_wrap_counter`` |
| |
| #. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to |
| ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` |
| (roll back any in-progress update) |
| |
| #. Set the ``inflight`` field of each ``DescStatePacked`` entry in |
| free list to 0 |
| |
| #. Resubmit inflight ``DescStatePacked`` entries in order of their |
| counter value |
| |
| In-band notifications |
| --------------------- |
| |
| In some limited situations (e.g. for simulation) it is desirable to |
| have the kick, call and error (if used) signals done via in-band |
| messages instead of asynchronous eventfd notifications. This can be |
| done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` |
| protocol feature. |
| |
| Note that due to the fact that too many messages on the sockets can |
| cause the sending application(s) to block, it is not advised to use |
| this feature unless absolutely necessary. It is also considered an |
| error to negotiate this feature without also negotiating |
| ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``, |
| the former is necessary for getting a message channel from the slave |
| to the master, while the latter needs to be used with the in-band |
| notification messages to block until they are processed, both to avoid |
| blocking later and for proper processing (at least in the simulation |
| use case.) As it has no other way of signalling this error, the slave |
| should close the connection as a response to a |
| ``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band |
| notifications feature flag without the other two. |
| |
| Protocol features |
| ----------------- |
| |
| .. code:: c |
| |
| #define VHOST_USER_PROTOCOL_F_MQ 0 |
| #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 |
| #define VHOST_USER_PROTOCOL_F_RARP 2 |
| #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 |
| #define VHOST_USER_PROTOCOL_F_MTU 4 |
| #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5 |
| #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6 |
| #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7 |
| #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8 |
| #define VHOST_USER_PROTOCOL_F_CONFIG 9 |
| #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10 |
| #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 |
| #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 |
| #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13 |
| #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14 |
| #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 |
| #define VHOST_USER_PROTOCOL_F_STATUS 16 |
| |
| Master message types |
| -------------------- |
| |
| ``VHOST_USER_GET_FEATURES`` |
| :id: 1 |
| :equivalent ioctl: ``VHOST_GET_FEATURES`` |
| :master payload: N/A |
| :slave payload: ``u64`` |
| |
| Get from the underlying vhost implementation the features bitmask. |
| Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support |
| for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and |
| ``VHOST_USER_SET_PROTOCOL_FEATURES``. |
| |
| ``VHOST_USER_SET_FEATURES`` |
| :id: 2 |
| :equivalent ioctl: ``VHOST_SET_FEATURES`` |
| :master payload: ``u64`` |
| |
| Enable features in the underlying vhost implementation using a |
| bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals |
| slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and |
| ``VHOST_USER_SET_PROTOCOL_FEATURES``. |
| |
| ``VHOST_USER_GET_PROTOCOL_FEATURES`` |
| :id: 15 |
| :equivalent ioctl: ``VHOST_GET_FEATURES`` |
| :master payload: N/A |
| :slave payload: ``u64`` |
| |
| Get the protocol feature bitmask from the underlying vhost |
| implementation. Only legal if feature bit |
| ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in |
| ``VHOST_USER_GET_FEATURES``. |
| |
| .. Note:: |
| Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must |
| support this message even before ``VHOST_USER_SET_FEATURES`` was |
| called. |
| |
| ``VHOST_USER_SET_PROTOCOL_FEATURES`` |
| :id: 16 |
| :equivalent ioctl: ``VHOST_SET_FEATURES`` |
| :master payload: ``u64`` |
| |
| Enable protocol features in the underlying vhost implementation. |
| |
| Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in |
| ``VHOST_USER_GET_FEATURES``. |
| |
| .. Note:: |
| Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support |
| this message even before ``VHOST_USER_SET_FEATURES`` was called. |
| |
| ``VHOST_USER_SET_OWNER`` |
| :id: 3 |
| :equivalent ioctl: ``VHOST_SET_OWNER`` |
| :master payload: N/A |
| |
| Issued when a new connection is established. It sets the current |
| *master* as an owner of the session. This can be used on the *slave* |
| as a "session start" flag. |
| |
| ``VHOST_USER_RESET_OWNER`` |
| :id: 4 |
| :master payload: N/A |
| |
| .. admonition:: Deprecated |
| |
| This is no longer used. Used to be sent to request disabling all |
| rings, but some clients interpreted it to also discard connection |
| state (this interpretation would lead to bugs). It is recommended |
| that clients either ignore this message, or use it to disable all |
| rings. |
| |
| ``VHOST_USER_SET_MEM_TABLE`` |
| :id: 5 |
| :equivalent ioctl: ``VHOST_SET_MEM_TABLE`` |
| :master payload: memory regions description |
| :slave payload: (postcopy only) memory regions description |
| |
| Sets the memory map regions on the slave so it can translate the |
| vring addresses. In the ancillary data there is an array of file |
| descriptors for each memory mapped region. The size and ordering of |
| the fds matches the number and ordering of memory regions. |
| |
| When ``VHOST_USER_POSTCOPY_LISTEN`` has been received, |
| ``SET_MEM_TABLE`` replies with the bases of the memory mapped |
| regions to the master. The slave must have mmap'd the regions but |
| not yet accessed them and should not yet generate a userfault |
| event. |
| |
| .. Note:: |
| ``NEED_REPLY_MASK`` is not set in this case. QEMU will then |
| reply back to the list of mappings with an empty |
| ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon |
| reception of this message may the guest start accessing the memory |
| and generating faults. |
| |
| ``VHOST_USER_SET_LOG_BASE`` |
| :id: 6 |
| :equivalent ioctl: ``VHOST_SET_LOG_BASE`` |
| :master payload: u64 |
| :slave payload: N/A |
| |
| Sets logging shared memory space. |
| |
| When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature, |
| the log memory fd is provided in the ancillary data of |
| ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared |
| memory area provided in the message. |
| |
| ``VHOST_USER_SET_LOG_FD`` |
| :id: 7 |
| :equivalent ioctl: ``VHOST_SET_LOG_FD`` |
| :master payload: N/A |
| |
| Sets the logging file descriptor, which is passed as ancillary data. |
| |
| ``VHOST_USER_SET_VRING_NUM`` |
| :id: 8 |
| :equivalent ioctl: ``VHOST_SET_VRING_NUM`` |
| :master payload: vring state description |
| |
| Set the size of the queue. |
| |
| ``VHOST_USER_SET_VRING_ADDR`` |
| :id: 9 |
| :equivalent ioctl: ``VHOST_SET_VRING_ADDR`` |
| :master payload: vring address description |
| :slave payload: N/A |
| |
| Sets the addresses of the different aspects of the vring. |
| |
| ``VHOST_USER_SET_VRING_BASE`` |
| :id: 10 |
| :equivalent ioctl: ``VHOST_SET_VRING_BASE`` |
| :master payload: vring state description |
| |
| Sets the base offset in the available vring. |
| |
| ``VHOST_USER_GET_VRING_BASE`` |
| :id: 11 |
| :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` |
| :master payload: vring state description |
| :slave payload: vring state description |
| |
| Get the available vring base offset. |
| |
| ``VHOST_USER_SET_VRING_KICK`` |
| :id: 12 |
| :equivalent ioctl: ``VHOST_SET_VRING_KICK`` |
| :master payload: ``u64`` |
| |
| Set the event file descriptor for adding buffers to the vring. It is |
| passed in the ancillary data. |
| |
| Bits (0-7) of the payload contain the vring index. Bit 8 is the |
| invalid FD flag. This flag is set when there is no file descriptor |
| in the ancillary data. This signals that polling should be used |
| instead of waiting for the kick. Note that if the protocol feature |
| ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated |
| this message isn't necessary as the ring is also started on the |
| ``VHOST_USER_VRING_KICK`` message, it may however still be used to |
| set an event file descriptor (which will be preferred over the |
| message) or to enable polling. |
| |
| ``VHOST_USER_SET_VRING_CALL`` |
| :id: 13 |
| :equivalent ioctl: ``VHOST_SET_VRING_CALL`` |
| :master payload: ``u64`` |
| |
| Set the event file descriptor to signal when buffers are used. It is |
| passed in the ancillary data. |
| |
| Bits (0-7) of the payload contain the vring index. Bit 8 is the |
| invalid FD flag. This flag is set when there is no file descriptor |
| in the ancillary data. This signals that polling will be used |
| instead of waiting for the call. Note that if the protocol features |
| ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and |
| ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message |
| isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be |
| used, it may however still be used to set an event file descriptor |
| or to enable polling. |
| |
| ``VHOST_USER_SET_VRING_ERR`` |
| :id: 14 |
| :equivalent ioctl: ``VHOST_SET_VRING_ERR`` |
| :master payload: ``u64`` |
| |
| Set the event file descriptor to signal when error occurs. It is |
| passed in the ancillary data. |
| |
| Bits (0-7) of the payload contain the vring index. Bit 8 is the |
| invalid FD flag. This flag is set when there is no file descriptor |
| in the ancillary data. Note that if the protocol features |
| ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and |
| ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message |
| isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be |
| used, it may however still be used to set an event file descriptor |
| (which will be preferred over the message). |
| |
| ``VHOST_USER_GET_QUEUE_NUM`` |
| :id: 17 |
| :equivalent ioctl: N/A |
| :master payload: N/A |
| :slave payload: u64 |
| |
| Query how many queues the backend supports. |
| |
| This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ`` |
| is set in queried protocol features by |
| ``VHOST_USER_GET_PROTOCOL_FEATURES``. |
| |
| ``VHOST_USER_SET_VRING_ENABLE`` |
| :id: 18 |
| :equivalent ioctl: N/A |
| :master payload: vring state description |
| |
| Signal slave to enable or disable corresponding vring. |
| |
| This request should be sent only when |
| ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated. |
| |
| ``VHOST_USER_SEND_RARP`` |
| :id: 19 |
| :equivalent ioctl: N/A |
| :master payload: ``u64`` |
| |
| Ask vhost user backend to broadcast a fake RARP to notify the migration |
| is terminated for guest that does not support GUEST_ANNOUNCE. |
| |
| Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is |
| present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit |
| ``VHOST_USER_PROTOCOL_F_RARP`` is present in |
| ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the |
| payload contain the mac address of the guest to allow the vhost user |
| backend to construct and broadcast the fake RARP. |
| |
| ``VHOST_USER_NET_SET_MTU`` |
| :id: 20 |
| :equivalent ioctl: N/A |
| :master payload: ``u64`` |
| |
| Set host MTU value exposed to the guest. |
| |
| This request should be sent only when ``VIRTIO_NET_F_MTU`` feature |
| has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES`` |
| is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit |
| ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in |
| ``VHOST_USER_GET_PROTOCOL_FEATURES``. |
| |
| If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must |
| respond with zero in case the specified MTU is valid, or non-zero |
| otherwise. |
| |
| ``VHOST_USER_SET_SLAVE_REQ_FD`` |
| :id: 21 |
| :equivalent ioctl: N/A |
| :master payload: N/A |
| |
| Set the socket file descriptor for slave initiated requests. It is passed |
| in the ancillary data. |
| |
| This request should be sent only when |
| ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol |
| feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in |
| ``VHOST_USER_GET_PROTOCOL_FEATURES``. If |
| ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must |
| respond with zero for success, non-zero otherwise. |
| |
| ``VHOST_USER_IOTLB_MSG`` |
| :id: 22 |
| :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) |
| :master payload: ``struct vhost_iotlb_msg`` |
| :slave payload: ``u64`` |
| |
| Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. |
| |
| Master sends such requests to update and invalidate entries in the |
| device IOTLB. The slave has to acknowledge the request with sending |
| zero as ``u64`` payload for success, non-zero otherwise. |
| |
| This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM`` |
| feature has been successfully negotiated. |
| |
| ``VHOST_USER_SET_VRING_ENDIAN`` |
| :id: 23 |
| :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN`` |
| :master payload: vring state description |
| |
| Set the endianness of a VQ for legacy devices. Little-endian is |
| indicated with state.num set to 0 and big-endian is indicated with |
| state.num set to 1. Other values are invalid. |
| |
| This request should be sent only when |
| ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated. |
| Backends that negotiated this feature should handle both |
| endiannesses and expect this message once (per VQ) during device |
| configuration (ie. before the master starts the VQ). |
| |
| ``VHOST_USER_GET_CONFIG`` |
| :id: 24 |
| :equivalent ioctl: N/A |
| :master payload: virtio device config space |
| :slave payload: virtio device config space |
| |
| When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is |
| submitted by the vhost-user master to fetch the contents of the |
| virtio device configuration space, vhost-user slave's payload size |
| MUST match master's request, vhost-user slave uses zero length of |
| payload to indicate an error to vhost-user master. The vhost-user |
| master may cache the contents to avoid repeated |
| ``VHOST_USER_GET_CONFIG`` calls. |
| |
| ``VHOST_USER_SET_CONFIG`` |
| :id: 25 |
| :equivalent ioctl: N/A |
| :master payload: virtio device config space |
| :slave payload: N/A |
| |
| When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is |
| submitted by the vhost-user master when the Guest changes the virtio |
| device configuration space and also can be used for live migration |
| on the destination host. The vhost-user slave must check the flags |
| field, and slaves MUST NOT accept SET_CONFIG for read-only |
| configuration space fields unless the live migration bit is set. |
| |
| ``VHOST_USER_CREATE_CRYPTO_SESSION`` |
| :id: 26 |
| :equivalent ioctl: N/A |
| :master payload: crypto session description |
| :slave payload: crypto session description |
| |
| Create a session for crypto operation. The server side must return |
| the session id, 0 or positive for success, negative for failure. |
| This request should be sent only when |
| ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been |
| successfully negotiated. It's a required feature for crypto |
| devices. |
| |
| ``VHOST_USER_CLOSE_CRYPTO_SESSION`` |
| :id: 27 |
| :equivalent ioctl: N/A |
| :master payload: ``u64`` |
| |
| Close a session for crypto operation which was previously |
| created by ``VHOST_USER_CREATE_CRYPTO_SESSION``. |
| |
| This request should be sent only when |
| ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been |
| successfully negotiated. It's a required feature for crypto |
| devices. |
| |
| ``VHOST_USER_POSTCOPY_ADVISE`` |
| :id: 28 |
| :master payload: N/A |
| :slave payload: userfault fd |
| |
| When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master |
| advises slave that a migration with postcopy enabled is underway, |
| the slave must open a userfaultfd for later use. Note that at this |
| stage the migration is still in precopy mode. |
| |
| ``VHOST_USER_POSTCOPY_LISTEN`` |
| :id: 29 |
| :master payload: N/A |
| |
| Master advises slave that a transition to postcopy mode has |
| happened. The slave must ensure that shared memory is registered |
| with userfaultfd to cause faulting of non-present pages. |
| |
| This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``, |
| and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported. |
| |
| ``VHOST_USER_POSTCOPY_END`` |
| :id: 30 |
| :slave payload: ``u64`` |
| |
| Master advises that postcopy migration has now completed. The slave |
| must disable the userfaultfd. The response is an acknowledgement |
| only. |
| |
| When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message |
| is sent at the end of the migration, after |
| ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent. |
| |
| The value returned is an error indication; 0 is success. |
| |
| ``VHOST_USER_GET_INFLIGHT_FD`` |
| :id: 31 |
| :equivalent ioctl: N/A |
| :master payload: inflight description |
| |
| When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has |
| been successfully negotiated, this message is submitted by master to |
| get a shared buffer from slave. The shared buffer will be used to |
| track inflight I/O by slave. QEMU should retrieve a new one when vm |
| reset. |
| |
| ``VHOST_USER_SET_INFLIGHT_FD`` |
| :id: 32 |
| :equivalent ioctl: N/A |
| :master payload: inflight description |
| |
| When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has |
| been successfully negotiated, this message is submitted by master to |
| send the shared inflight buffer back to slave so that slave could |
| get inflight I/O after a crash or restart. |
| |
| ``VHOST_USER_GPU_SET_SOCKET`` |
| :id: 33 |
| :equivalent ioctl: N/A |
| :master payload: N/A |
| |
| Sets the GPU protocol socket file descriptor, which is passed as |
| ancillary data. The GPU protocol is used to inform the master of |
| rendering state and updates. See vhost-user-gpu.rst for details. |
| |
| ``VHOST_USER_RESET_DEVICE`` |
| :id: 34 |
| :equivalent ioctl: N/A |
| :master payload: N/A |
| :slave payload: N/A |
| |
| Ask the vhost user backend to disable all rings and reset all |
| internal device state to the initial state, ready to be |
| reinitialized. The backend retains ownership of the device |
| throughout the reset operation. |
| |
| Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol |
| feature is set by the backend. |
| |
| ``VHOST_USER_VRING_KICK`` |
| :id: 35 |
| :equivalent ioctl: N/A |
| :slave payload: vring state description |
| :master payload: N/A |
| |
| When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol |
| feature has been successfully negotiated, this message may be |
| submitted by the master to indicate that a buffer was added to |
| the vring instead of signalling it using the vring's kick file |
| descriptor or having the slave rely on polling. |
| |
| The state.num field is currently reserved and must be set to 0. |
| |
| ``VHOST_USER_GET_MAX_MEM_SLOTS`` |
| :id: 36 |
| :equivalent ioctl: N/A |
| :slave payload: u64 |
| |
| When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol |
| feature has been successfully negotiated, this message is submitted |
| by master to the slave. The slave should return the message with a |
| u64 payload containing the maximum number of memory slots for |
| QEMU to expose to the guest. The value returned by the backend |
| will be capped at the maximum number of ram slots which can be |
| supported by the target platform. |
| |
| ``VHOST_USER_ADD_MEM_REG`` |
| :id: 37 |
| :equivalent ioctl: N/A |
| :slave payload: single memory region description |
| |
| When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol |
| feature has been successfully negotiated, this message is submitted |
| by the master to the slave. The message payload contains a memory |
| region descriptor struct, describing a region of guest memory which |
| the slave device must map in. When the |
| ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has |
| been successfully negotiated, along with the |
| ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and |
| update the memory tables of the slave device. |
| |
| ``VHOST_USER_REM_MEM_REG`` |
| :id: 38 |
| :equivalent ioctl: N/A |
| :slave payload: single memory region description |
| |
| When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol |
| feature has been successfully negotiated, this message is submitted |
| by the master to the slave. The message payload contains a memory |
| region descriptor struct, describing a region of guest memory which |
| the slave device must unmap. When the |
| ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has |
| been successfully negotiated, along with the |
| ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and |
| update the memory tables of the slave device. |
| |
| ``VHOST_USER_SET_STATUS`` |
| :id: 39 |
| :equivalent ioctl: VHOST_VDPA_SET_STATUS |
| :slave payload: N/A |
| :master payload: ``u64`` |
| |
| When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been |
| successfully negotiated, this message is submitted by the master to |
| notify the backend with updated device status as defined in the Virtio |
| specification. |
| |
| ``VHOST_USER_GET_STATUS`` |
| :id: 40 |
| :equivalent ioctl: VHOST_VDPA_GET_STATUS |
| :slave payload: ``u64`` |
| :master payload: N/A |
| |
| When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been |
| successfully negotiated, this message is submitted by the master to |
| query the backend for its device status as defined in the Virtio |
| specification. |
| |
| |
| Slave message types |
| ------------------- |
| |
| ``VHOST_USER_SLAVE_IOTLB_MSG`` |
| :id: 1 |
| :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) |
| :slave payload: ``struct vhost_iotlb_msg`` |
| :master payload: N/A |
| |
| Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. |
| Slave sends such requests to notify of an IOTLB miss, or an IOTLB |
| access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is |
| negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master |
| must respond with zero when operation is successfully completed, or |
| non-zero otherwise. This request should be send only when |
| ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully |
| negotiated. |
| |
| ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG`` |
| :id: 2 |
| :equivalent ioctl: N/A |
| :slave payload: N/A |
| :master payload: N/A |
| |
| When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user |
| slave sends such messages to notify that the virtio device's |
| configuration space has changed, for those host devices which can |
| support such feature, host driver can send ``VHOST_USER_GET_CONFIG`` |
| message to slave to get the latest content. If |
| ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the |
| ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when |
| operation is successfully completed, or non-zero otherwise. |
| |
| ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG`` |
| :id: 3 |
| :equivalent ioctl: N/A |
| :slave payload: vring area description |
| :master payload: N/A |
| |
| Sets host notifier for a specified queue. The queue index is |
| contained in the ``u64`` field of the vring area description. The |
| host notifier is described by the file descriptor (typically it's a |
| VFIO device fd) which is passed as ancillary data and the size |
| (which is mmap size and should be the same as host page size) and |
| offset (which is mmap offset) carried in the vring area |
| description. QEMU can mmap the file descriptor based on the size and |
| offset to get a memory range. Registering a host notifier means |
| mapping this memory range to the VM as the specified queue's notify |
| MMIO region. Slave sends this request to tell QEMU to de-register |
| the existing notifier if any and register the new notifier if the |
| request is sent with a file descriptor. |
| |
| This request should be sent only when |
| ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been |
| successfully negotiated. |
| |
| ``VHOST_USER_SLAVE_VRING_CALL`` |
| :id: 4 |
| :equivalent ioctl: N/A |
| :slave payload: vring state description |
| :master payload: N/A |
| |
| When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol |
| feature has been successfully negotiated, this message may be |
| submitted by the slave to indicate that a buffer was used from |
| the vring instead of signalling this using the vring's call file |
| descriptor or having the master relying on polling. |
| |
| The state.num field is currently reserved and must be set to 0. |
| |
| ``VHOST_USER_SLAVE_VRING_ERR`` |
| :id: 5 |
| :equivalent ioctl: N/A |
| :slave payload: vring state description |
| :master payload: N/A |
| |
| When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol |
| feature has been successfully negotiated, this message may be |
| submitted by the slave to indicate that an error occurred on the |
| specific vring, instead of signalling the error file descriptor |
| set by the master via ``VHOST_USER_SET_VRING_ERR``. |
| |
| The state.num field is currently reserved and must be set to 0. |
| |
| .. _reply_ack: |
| |
| VHOST_USER_PROTOCOL_F_REPLY_ACK |
| ------------------------------- |
| |
| The original vhost-user specification only demands replies for certain |
| commands. This differs from the vhost protocol implementation where |
| commands are sent over an ``ioctl()`` call and block until the client |
| has completed. |
| |
| With this protocol extension negotiated, the sender (QEMU) can set the |
| ``need_reply`` [Bit 3] flag to any command. This indicates that the |
| client MUST respond with a Payload ``VhostUserMsg`` indicating success |
| or failure. The payload should be set to zero on success or non-zero |
| on failure, unless the message already has an explicit reply body. |
| |
| The response payload gives QEMU a deterministic indication of the result |
| of the command. Today, QEMU is expected to terminate the main vhost-user |
| loop upon receiving such errors. In future, qemu could be taught to be more |
| resilient for selective requests. |
| |
| For the message types that already solicit a reply from the client, |
| the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit |
| being set brings no behavioural change. (See the Communication_ |
| section for details.) |
| |
| .. _backend_conventions: |
| |
| Backend program conventions |
| =========================== |
| |
| vhost-user backends can provide various devices & services and may |
| need to be configured manually depending on the use case. However, it |
| is a good idea to follow the conventions listed here when |
| possible. Users, QEMU or libvirt, can then rely on some common |
| behaviour to avoid heterogeneous configuration and management of the |
| backend programs and facilitate interoperability. |
| |
| Each backend installed on a host system should come with at least one |
| JSON file that conforms to the vhost-user.json schema. Each file |
| informs the management applications about the backend type, and binary |
| location. In addition, it defines rules for management apps for |
| picking the highest priority backend when multiple match the search |
| criteria (see ``@VhostUserBackend`` documentation in the schema file). |
| |
| If the backend is not capable of enabling a requested feature on the |
| host (such as 3D acceleration with virgl), or the initialization |
| failed, the backend should fail to start early and exit with a status |
| != 0. It may also print a message to stderr for further details. |
| |
| The backend program must not daemonize itself, but it may be |
| daemonized by the management layer. It may also have a restricted |
| access to the system. |
| |
| File descriptors 0, 1 and 2 will exist, and have regular |
| stdin/stdout/stderr usage (they may have been redirected to /dev/null |
| by the management layer, or to a log handler). |
| |
| The backend program must end (as quickly and cleanly as possible) when |
| the SIGTERM signal is received. Eventually, it may receive SIGKILL by |
| the management layer after a few seconds. |
| |
| The following command line options have an expected behaviour. They |
| are mandatory, unless explicitly said differently: |
| |
| --socket-path=PATH |
| |
| This option specify the location of the vhost-user Unix domain socket. |
| It is incompatible with --fd. |
| |
| --fd=FDNUM |
| |
| When this argument is given, the backend program is started with the |
| vhost-user socket as file descriptor FDNUM. It is incompatible with |
| --socket-path. |
| |
| --print-capabilities |
| |
| Output to stdout the backend capabilities in JSON format, and then |
| exit successfully. Other options and arguments should be ignored, and |
| the backend program should not perform its normal function. The |
| capabilities can be reported dynamically depending on the host |
| capabilities. |
| |
| The JSON output is described in the ``vhost-user.json`` schema, by |
| ```@VHostUserBackendCapabilities``. Example: |
| |
| .. code:: json |
| |
| { |
| "type": "foo", |
| "features": [ |
| "feature-a", |
| "feature-b" |
| ] |
| } |
| |
| vhost-user-input |
| ---------------- |
| |
| Command line options: |
| |
| --evdev-path=PATH |
| |
| Specify the linux input device. |
| |
| (optional) |
| |
| --no-grab |
| |
| Do no request exclusive access to the input device. |
| |
| (optional) |
| |
| vhost-user-gpu |
| -------------- |
| |
| Command line options: |
| |
| --render-node=PATH |
| |
| Specify the GPU DRM render node. |
| |
| (optional) |
| |
| --virgl |
| |
| Enable virgl rendering support. |
| |
| (optional) |
| |
| vhost-user-blk |
| -------------- |
| |
| Command line options: |
| |
| --blk-file=PATH |
| |
| Specify block device or file path. |
| |
| (optional) |
| |
| --read-only |
| |
| Enable read-only. |
| |
| (optional) |