| Vhost-user Protocol |
| =================== |
| |
| Copyright (c) 2014 Virtual Open Systems Sarl. |
| |
| This work is licensed under the terms of the GNU GPL, version 2 or later. |
| See the COPYING file in the top-level directory. |
| =================== |
| |
| This protocol is aiming to complement the ioctl interface used to control the |
| vhost implementation in the Linux kernel. It implements the control plane needed |
| to establish virtqueue sharing with a user space process on the same host. It |
| uses communication over a Unix domain socket to share file descriptors in the |
| ancillary data of the message. |
| |
| The protocol defines 2 sides of the communication, master and slave. Master is |
| the application that shares its virtqueues, in our case QEMU. Slave is the |
| consumer of the virtqueues. |
| |
| In the current implementation QEMU is the Master, and the Slave is the |
| external process consuming the virtio queues, for example a software |
| Ethernet switch running in user space, such as Snabbswitch, or a block |
| device backend processing read & write to a virtual disk. In order to |
| facilitate interoperability between various backend implementations, |
| it is recommended to follow the "Backend program conventions" |
| described in this document. |
| |
| Master and slave can be either a client (i.e. connecting) or server (listening) |
| in the socket communication. |
| |
| Message Specification |
| --------------------- |
| |
| Note that all numbers are in the machine native byte order. A vhost-user message |
| consists of 3 header fields and a payload: |
| |
| ------------------------------------ |
| | request | flags | size | payload | |
| ------------------------------------ |
| |
| * Request: 32-bit type of the request |
| * Flags: 32-bit bit field: |
| - Lower 2 bits are the version (currently 0x01) |
| - Bit 2 is the reply flag - needs to be sent on each reply from the slave |
| - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for |
| details. |
| * Size - 32-bit size of the payload |
| |
| |
| Depending on the request type, payload can be: |
| |
| * A single 64-bit integer |
| ------- |
| | u64 | |
| ------- |
| |
| u64: a 64-bit unsigned integer |
| |
| * A vring state description |
| --------------- |
| | index | num | |
| --------------- |
| |
| Index: a 32-bit index |
| Num: a 32-bit number |
| |
| * A vring address description |
| -------------------------------------------------------------- |
| | index | flags | size | descriptor | used | available | log | |
| -------------------------------------------------------------- |
| |
| Index: a 32-bit vring index |
| Flags: a 32-bit vring flags |
| Descriptor: a 64-bit ring address of the vring descriptor table |
| Used: a 64-bit ring address of the vring used ring |
| Available: a 64-bit ring address of the vring available ring |
| Log: a 64-bit guest address for logging |
| |
| Note that a ring address is an IOVA if VIRTIO_F_IOMMU_PLATFORM has been |
| negotiated. Otherwise it is a user address. |
| |
| * Memory regions description |
| --------------------------------------------------- |
| | num regions | padding | region0 | ... | region7 | |
| --------------------------------------------------- |
| |
| Num regions: a 32-bit number of regions |
| Padding: 32-bit |
| |
| A region is: |
| ----------------------------------------------------- |
| | guest address | size | user address | mmap offset | |
| ----------------------------------------------------- |
| |
| Guest address: a 64-bit guest address of the region |
| Size: a 64-bit size |
| User address: a 64-bit user address |
| mmap offset: 64-bit offset where region starts in the mapped memory |
| |
| * Log description |
| --------------------------- |
| | log size | log offset | |
| --------------------------- |
| log size: size of area used for logging |
| log offset: offset from start of supplied file descriptor |
| where logging starts (i.e. where guest address 0 would be logged) |
| |
| * An IOTLB message |
| --------------------------------------------------------- |
| | iova | size | user address | permissions flags | type | |
| --------------------------------------------------------- |
| |
| IOVA: a 64-bit I/O virtual address programmed by the guest |
| Size: a 64-bit size |
| User address: a 64-bit user address |
| Permissions: an 8-bit value: |
| - 0: No access |
| - 1: Read access |
| - 2: Write access |
| - 3: Read/Write access |
| Type: an 8-bit IOTLB message type: |
| - 1: IOTLB miss |
| - 2: IOTLB update |
| - 3: IOTLB invalidate |
| - 4: IOTLB access fail |
| |
| * Virtio device config space |
| ----------------------------------- |
| | offset | size | flags | payload | |
| ----------------------------------- |
| |
| Offset: a 32-bit offset of virtio device's configuration space |
| Size: a 32-bit configuration space access size in bytes |
| Flags: a 32-bit value: |
| - 0: Vhost master messages used for writeable fields |
| - 1: Vhost master messages used for live migration |
| Payload: Size bytes array holding the contents of the virtio |
| device's configuration space |
| |
| * Vring area description |
| ----------------------- |
| | u64 | size | offset | |
| ----------------------- |
| |
| u64: a 64-bit integer contains vring index and flags |
| Size: a 64-bit size of this area |
| Offset: a 64-bit offset of this area from the start of the |
| supplied file descriptor |
| |
| * Inflight description |
| ----------------------------------------------------- |
| | mmap size | mmap offset | num queues | queue size | |
| ----------------------------------------------------- |
| |
| mmap size: a 64-bit size of area to track inflight I/O |
| mmap offset: a 64-bit offset of this area from the start |
| of the supplied file descriptor |
| num queues: a 16-bit number of virtqueues |
| queue size: a 16-bit size of virtqueues |
| |
| In QEMU the vhost-user message is implemented with the following struct: |
| |
| typedef struct VhostUserMsg { |
| VhostUserRequest request; |
| uint32_t flags; |
| uint32_t size; |
| union { |
| uint64_t u64; |
| struct vhost_vring_state state; |
| struct vhost_vring_addr addr; |
| VhostUserMemory memory; |
| VhostUserLog log; |
| struct vhost_iotlb_msg iotlb; |
| VhostUserConfig config; |
| VhostUserVringArea area; |
| VhostUserInflight inflight; |
| }; |
| } QEMU_PACKED VhostUserMsg; |
| |
| Communication |
| ------------- |
| |
| The protocol for vhost-user is based on the existing implementation of vhost |
| for the Linux Kernel. Most messages that can be sent via the Unix domain socket |
| implementing vhost-user have an equivalent ioctl to the kernel implementation. |
| |
| The communication consists of master sending message requests and slave sending |
| message replies. Most of the requests don't require replies. Here is a list of |
| the ones that do: |
| |
| * VHOST_USER_GET_FEATURES |
| * VHOST_USER_GET_PROTOCOL_FEATURES |
| * VHOST_USER_GET_VRING_BASE |
| * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) |
| * VHOST_USER_GET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD) |
| |
| [ Also see the section on REPLY_ACK protocol extension. ] |
| |
| There are several messages that the master sends with file descriptors passed |
| in the ancillary data: |
| |
| * VHOST_USER_SET_MEM_TABLE |
| * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) |
| * VHOST_USER_SET_LOG_FD |
| * VHOST_USER_SET_VRING_KICK |
| * VHOST_USER_SET_VRING_CALL |
| * VHOST_USER_SET_VRING_ERR |
| * VHOST_USER_SET_SLAVE_REQ_FD |
| * VHOST_USER_SET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD) |
| |
| If Master is unable to send the full message or receives a wrong reply it will |
| close the connection. An optional reconnection mechanism can be implemented. |
| |
| Any protocol extensions are gated by protocol feature bits, |
| which allows full backwards compatibility on both master |
| and slave. |
| As older slaves don't support negotiating protocol features, |
| a feature bit was dedicated for this purpose: |
| #define VHOST_USER_F_PROTOCOL_FEATURES 30 |
| |
| Starting and stopping rings |
| ---------------------- |
| Client must only process each ring when it is started. |
| |
| Client must only pass data between the ring and the |
| backend, when the ring is enabled. |
| |
| If ring is started but disabled, client must process the |
| ring without talking to the backend. |
| |
| For example, for a networking device, in the disabled state |
| client must not supply any new RX packets, but must process |
| and discard any TX packets. |
| |
| If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized |
| in an enabled state. |
| |
| If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized |
| in a disabled state. Client must not pass data to/from the backend until ring is enabled by |
| VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by |
| VHOST_USER_SET_VRING_ENABLE with parameter 0. |
| |
| Each ring is initialized in a stopped state, client must not process it until |
| ring is started, or after it has been stopped. |
| |
| Client must start ring upon receiving a kick (that is, detecting that file |
| descriptor is readable) on the descriptor specified by |
| VHOST_USER_SET_VRING_KICK, and stop ring upon receiving |
| VHOST_USER_GET_VRING_BASE. |
| |
| While processing the rings (whether they are enabled or not), client must |
| support changing some configuration aspects on the fly. |
| |
| Multiple queue support |
| ---------------------- |
| |
| Multiple queue is treated as a protocol extension, hence the slave has to |
| implement protocol features first. The multiple queues feature is supported |
| only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set. |
| |
| The max number of queue pairs the slave supports can be queried with message |
| VHOST_USER_GET_QUEUE_NUM. Master should stop when the number of |
| requested queues is bigger than that. |
| |
| As all queues share one connection, the master uses a unique index for each |
| queue in the sent message to identify a specified queue. One queue pair |
| is enabled initially. More queues are enabled dynamically, by sending |
| message VHOST_USER_SET_VRING_ENABLE. |
| |
| Migration |
| --------- |
| |
| During live migration, the master may need to track the modifications |
| the slave makes to the memory mapped regions. The client should mark |
| the dirty pages in a log. Once it complies to this logging, it may |
| declare the VHOST_F_LOG_ALL vhost feature. |
| |
| To start/stop logging of data/used ring writes, server may send messages |
| VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with |
| VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively. |
| |
| All the modifications to memory pointed by vring "descriptor" should |
| be marked. Modifications to "used" vring should be marked if |
| VHOST_VRING_F_LOG is part of ring's flags. |
| |
| Dirty pages are of size: |
| #define VHOST_LOG_PAGE 0x1000 |
| |
| The log memory fd is provided in the ancillary data of |
| VHOST_USER_SET_LOG_BASE message when the slave has |
| VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature. |
| |
| The size of the log is supplied as part of VhostUserMsg |
| which should be large enough to cover all known guest |
| addresses. Log starts at the supplied offset in the |
| supplied file descriptor. |
| The log covers from address 0 to the maximum of guest |
| regions. In pseudo-code, to mark page at "addr" as dirty: |
| |
| page = addr / VHOST_LOG_PAGE |
| log[page / 8] |= 1 << page % 8 |
| |
| Where addr is the guest physical address. |
| |
| Use atomic operations, as the log may be concurrently manipulated. |
| |
| Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG |
| is set for this ring), log_guest_addr should be used to calculate the log |
| offset: the write to first byte of the used ring is logged at this offset from |
| log start. Also note that this value might be outside the legal guest physical |
| address range (i.e. does not have to be covered by the VhostUserMemory table), |
| but the bit offset of the last byte of the ring must fall within |
| the size supplied by VhostUserLog. |
| |
| VHOST_USER_SET_LOG_FD is an optional message with an eventfd in |
| ancillary data, it may be used to inform the master that the log has |
| been modified. |
| |
| Once the source has finished migration, rings will be stopped by |
| the source. No further update must be done before rings are |
| restarted. |
| |
| In postcopy migration the slave is started before all the memory has been |
| received from the source host, and care must be taken to avoid accessing pages |
| that have yet to be received. The slave opens a 'userfault'-fd and registers |
| the memory with it; this fd is then passed back over to the master. |
| The master services requests on the userfaultfd for pages that are accessed |
| and when the page is available it performs WAKE ioctl's on the userfaultfd |
| to wake the stalled slave. The client indicates support for this via the |
| VHOST_USER_PROTOCOL_F_PAGEFAULT feature. |
| |
| Memory access |
| ------------- |
| |
| The master sends a list of vhost memory regions to the slave using the |
| VHOST_USER_SET_MEM_TABLE message. Each region has two base addresses: a guest |
| address and a user address. |
| |
| Messages contain guest addresses and/or user addresses to reference locations |
| within the shared memory. The mapping of these addresses works as follows. |
| |
| User addresses map to the vhost memory region containing that user address. |
| |
| When the VIRTIO_F_IOMMU_PLATFORM feature has not been negotiated: |
| |
| * Guest addresses map to the vhost memory region containing that guest |
| address. |
| |
| When the VIRTIO_F_IOMMU_PLATFORM feature has been negotiated: |
| |
| * Guest addresses are also called I/O virtual addresses (IOVAs). They are |
| translated to user addresses via the IOTLB. |
| |
| * The vhost memory region guest address is not used. |
| |
| IOMMU support |
| ------------- |
| |
| When the VIRTIO_F_IOMMU_PLATFORM feature has been negotiated, the master |
| sends IOTLB entries update & invalidation by sending VHOST_USER_IOTLB_MSG |
| requests to the slave with a struct vhost_iotlb_msg as payload. For update |
| events, the iotlb payload has to be filled with the update message type (2), |
| the I/O virtual address, the size, the user virtual address, and the |
| permissions flags. Addresses and size must be within vhost memory regions set |
| via the VHOST_USER_SET_MEM_TABLE request. For invalidation events, the iotlb |
| payload has to be filled with the invalidation message type (3), the I/O virtual |
| address and the size. On success, the slave is expected to reply with a zero |
| payload, non-zero otherwise. |
| |
| The slave relies on the slave communcation channel (see "Slave communication" |
| section below) to send IOTLB miss and access failure events, by sending |
| VHOST_USER_SLAVE_IOTLB_MSG requests to the master with a struct vhost_iotlb_msg |
| as payload. For miss events, the iotlb payload has to be filled with the miss |
| message type (1), the I/O virtual address and the permissions flags. For access |
| failure event, the iotlb payload has to be filled with the access failure |
| message type (4), the I/O virtual address and the permissions flags. |
| For synchronization purpose, the slave may rely on the reply-ack feature, |
| so the master may send a reply when operation is completed if the reply-ack |
| feature is negotiated and slaves requests a reply. For miss events, completed |
| operation means either master sent an update message containing the IOTLB entry |
| containing requested address and permission, or master sent nothing if the IOTLB |
| miss message is invalid (invalid IOVA or permission). |
| |
| The master isn't expected to take the initiative to send IOTLB update messages, |
| as the slave sends IOTLB miss messages for the guest virtual memory areas it |
| needs to access. |
| |
| Slave communication |
| ------------------- |
| |
| An optional communication channel is provided if the slave declares |
| VHOST_USER_PROTOCOL_F_SLAVE_REQ protocol feature, to allow the slave to make |
| requests to the master. |
| |
| The fd is provided via VHOST_USER_SET_SLAVE_REQ_FD ancillary data. |
| |
| A slave may then send VHOST_USER_SLAVE_* messages to the master |
| using this fd communication channel. |
| |
| If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated, |
| slave can send file descriptors (at most 8 descriptors in each message) |
| to master via ancillary data using this fd communication channel. |
| |
| Inflight I/O tracking |
| --------------------- |
| |
| To support reconnecting after restart or crash, slave may need to resubmit |
| inflight I/Os. If virtqueue is processed in order, we can easily achieve |
| that by getting the inflight descriptors from descriptor table (split virtqueue) |
| or descriptor ring (packed virtqueue). However, it can't work when we process |
| descriptors out-of-order because some entries which store the information of |
| inflight descriptors in available ring (split virtqueue) or descriptor |
| ring (packed virtqueue) might be overrided by new entries. To solve this |
| problem, slave need to allocate an extra buffer to store this information of inflight |
| descriptors and share it with master for persistent. VHOST_USER_GET_INFLIGHT_FD and |
| VHOST_USER_SET_INFLIGHT_FD are used to transfer this buffer between master |
| and slave. And the format of this buffer is described below: |
| |
| ------------------------------------------------------- |
| | queue0 region | queue1 region | ... | queueN region | |
| ------------------------------------------------------- |
| |
| N is the number of available virtqueues. Slave could get it from num queues |
| field of VhostUserInflight. |
| |
| For split virtqueue, queue region can be implemented as: |
| |
| typedef struct DescStateSplit { |
| /* Indicate whether this descriptor is inflight or not. |
| * Only available for head-descriptor. */ |
| uint8_t inflight; |
| |
| /* Padding */ |
| uint8_t padding[5]; |
| |
| /* Maintain a list for the last batch of used descriptors. |
| * Only available when batching is used for submitting */ |
| uint16_t next; |
| |
| /* Used to preserve the order of fetching available descriptors. |
| * Only available for head-descriptor. */ |
| uint64_t counter; |
| } DescStateSplit; |
| |
| typedef struct QueueRegionSplit { |
| /* The feature flags of this region. Now it's initialized to 0. */ |
| uint64_t features; |
| |
| /* The version of this region. It's 1 currently. |
| * Zero value indicates an uninitialized buffer */ |
| uint16_t version; |
| |
| /* The size of DescStateSplit array. It's equal to the virtqueue |
| * size. Slave could get it from queue size field of VhostUserInflight. */ |
| uint16_t desc_num; |
| |
| /* The head of list that track the last batch of used descriptors. */ |
| uint16_t last_batch_head; |
| |
| /* Store the idx value of used ring */ |
| uint16_t used_idx; |
| |
| /* Used to track the state of each descriptor in descriptor table */ |
| DescStateSplit desc[0]; |
| } QueueRegionSplit; |
| |
| To track inflight I/O, the queue region should be processed as follows: |
| |
| When receiving available buffers from the driver: |
| |
| 1. Get the next available head-descriptor index from available ring, i |
| |
| 2. Set desc[i].counter to the value of global counter |
| |
| 3. Increase global counter by 1 |
| |
| 4. Set desc[i].inflight to 1 |
| |
| When supplying used buffers to the driver: |
| |
| 1. Get corresponding used head-descriptor index, i |
| |
| 2. Set desc[i].next to last_batch_head |
| |
| 3. Set last_batch_head to i |
| |
| 4. Steps 1,2,3 may be performed repeatedly if batching is possible |
| |
| 5. Increase the idx value of used ring by the size of the batch |
| |
| 6. Set the inflight field of each DescStateSplit entry in the batch to 0 |
| |
| 7. Set used_idx to the idx value of used ring |
| |
| When reconnecting: |
| |
| 1. If the value of used_idx does not match the idx value of used ring (means |
| the inflight field of DescStateSplit entries in last batch may be incorrect), |
| |
| (a) Subtract the value of used_idx from the idx value of used ring to get |
| last batch size of DescStateSplit entries |
| |
| (b) Set the inflight field of each DescStateSplit entry to 0 in last batch |
| list which starts from last_batch_head |
| |
| (c) Set used_idx to the idx value of used ring |
| |
| 2. Resubmit inflight DescStateSplit entries in order of their counter value |
| |
| For packed virtqueue, queue region can be implemented as: |
| |
| typedef struct DescStatePacked { |
| /* Indicate whether this descriptor is inflight or not. |
| * Only available for head-descriptor. */ |
| uint8_t inflight; |
| |
| /* Padding */ |
| uint8_t padding; |
| |
| /* Link to the next free entry */ |
| uint16_t next; |
| |
| /* Link to the last entry of descriptor list. |
| * Only available for head-descriptor. */ |
| uint16_t last; |
| |
| /* The length of descriptor list. |
| * Only available for head-descriptor. */ |
| uint16_t num; |
| |
| /* Used to preserve the order of fetching available descriptors. |
| * Only available for head-descriptor. */ |
| uint64_t counter; |
| |
| /* The buffer id */ |
| uint16_t id; |
| |
| /* The descriptor flags */ |
| uint16_t flags; |
| |
| /* The buffer length */ |
| uint32_t len; |
| |
| /* The buffer address */ |
| uint64_t addr; |
| } DescStatePacked; |
| |
| typedef struct QueueRegionPacked { |
| /* The feature flags of this region. Now it's initialized to 0. */ |
| uint64_t features; |
| |
| /* The version of this region. It's 1 currently. |
| * Zero value indicates an uninitialized buffer */ |
| uint16_t version; |
| |
| /* The size of DescStatePacked array. It's equal to the virtqueue |
| * size. Slave could get it from queue size field of VhostUserInflight. */ |
| uint16_t desc_num; |
| |
| /* The head of free DescStatePacked entry list */ |
| uint16_t free_head; |
| |
| /* The old head of free DescStatePacked entry list */ |
| uint16_t old_free_head; |
| |
| /* The used index of descriptor ring */ |
| uint16_t used_idx; |
| |
| /* The old used index of descriptor ring */ |
| uint16_t old_used_idx; |
| |
| /* Device ring wrap counter */ |
| uint8_t used_wrap_counter; |
| |
| /* The old device ring wrap counter */ |
| uint8_t old_used_wrap_counter; |
| |
| /* Padding */ |
| uint8_t padding[7]; |
| |
| /* Used to track the state of each descriptor fetched from descriptor ring */ |
| DescStatePacked desc[0]; |
| } QueueRegionPacked; |
| |
| To track inflight I/O, the queue region should be processed as follows: |
| |
| When receiving available buffers from the driver: |
| |
| 1. Get the next available descriptor entry from descriptor ring, d |
| |
| 2. If d is head descriptor, |
| |
| (a) Set desc[old_free_head].num to 0 |
| |
| (b) Set desc[old_free_head].counter to the value of global counter |
| |
| (c) Increase global counter by 1 |
| |
| (d) Set desc[old_free_head].inflight to 1 |
| |
| 3. If d is last descriptor, set desc[old_free_head].last to free_head |
| |
| 4. Increase desc[old_free_head].num by 1 |
| |
| 5. Set desc[free_head].addr, desc[free_head].len, desc[free_head].flags, |
| desc[free_head].id to d.addr, d.len, d.flags, d.id |
| |
| 6. Set free_head to desc[free_head].next |
| |
| 7. If d is last descriptor, set old_free_head to free_head |
| |
| When supplying used buffers to the driver: |
| |
| 1. Get corresponding used head-descriptor entry from descriptor ring, d |
| |
| 2. Get corresponding DescStatePacked entry, e |
| |
| 3. Set desc[e.last].next to free_head |
| |
| 4. Set free_head to the index of e |
| |
| 5. Steps 1,2,3,4 may be performed repeatedly if batching is possible |
| |
| 6. Increase used_idx by the size of the batch and update used_wrap_counter if needed |
| |
| 7. Update d.flags |
| |
| 8. Set the inflight field of each head DescStatePacked entry in the batch to 0 |
| |
| 9. Set old_free_head, old_used_idx, old_used_wrap_counter to free_head, used_idx, |
| used_wrap_counter |
| |
| When reconnecting: |
| |
| 1. If used_idx does not match old_used_idx (means the inflight field of DescStatePacked |
| entries in last batch may be incorrect), |
| |
| (a) Get the next descriptor ring entry through old_used_idx, d |
| |
| (b) Use old_used_wrap_counter to calculate the available flags |
| |
| (c) If d.flags is not equal to the calculated flags value (means slave has |
| submitted the buffer to guest driver before crash, so it has to commit the |
| in-progres update), set old_free_head, old_used_idx, old_used_wrap_counter |
| to free_head, used_idx, used_wrap_counter |
| |
| 2. Set free_head, used_idx, used_wrap_counter to old_free_head, old_used_idx, |
| old_used_wrap_counter (roll back any in-progress update) |
| |
| 3. Set the inflight field of each DescStatePacked entry in free list to 0 |
| |
| 4. Resubmit inflight DescStatePacked entries in order of their counter value |
| |
| Protocol features |
| ----------------- |
| |
| #define VHOST_USER_PROTOCOL_F_MQ 0 |
| #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 |
| #define VHOST_USER_PROTOCOL_F_RARP 2 |
| #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 |
| #define VHOST_USER_PROTOCOL_F_MTU 4 |
| #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5 |
| #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6 |
| #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7 |
| #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8 |
| #define VHOST_USER_PROTOCOL_F_CONFIG 9 |
| #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10 |
| #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 |
| #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 |
| |
| Master message types |
| -------------------- |
| |
| * VHOST_USER_GET_FEATURES |
| |
| Id: 1 |
| Equivalent ioctl: VHOST_GET_FEATURES |
| Master payload: N/A |
| Slave payload: u64 |
| |
| Get from the underlying vhost implementation the features bitmask. |
| Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for |
| VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. |
| |
| * VHOST_USER_SET_FEATURES |
| |
| Id: 2 |
| Ioctl: VHOST_SET_FEATURES |
| Master payload: u64 |
| |
| Enable features in the underlying vhost implementation using a bitmask. |
| Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for |
| VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. |
| |
| * VHOST_USER_GET_PROTOCOL_FEATURES |
| |
| Id: 15 |
| Equivalent ioctl: VHOST_GET_FEATURES |
| Master payload: N/A |
| Slave payload: u64 |
| |
| Get the protocol feature bitmask from the underlying vhost implementation. |
| Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in |
| VHOST_USER_GET_FEATURES. |
| Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support |
| this message even before VHOST_USER_SET_FEATURES was called. |
| |
| * VHOST_USER_SET_PROTOCOL_FEATURES |
| |
| Id: 16 |
| Ioctl: VHOST_SET_FEATURES |
| Master payload: u64 |
| |
| Enable protocol features in the underlying vhost implementation. |
| Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in |
| VHOST_USER_GET_FEATURES. |
| Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support |
| this message even before VHOST_USER_SET_FEATURES was called. |
| |
| * VHOST_USER_SET_OWNER |
| |
| Id: 3 |
| Equivalent ioctl: VHOST_SET_OWNER |
| Master payload: N/A |
| |
| Issued when a new connection is established. It sets the current Master |
| as an owner of the session. This can be used on the Slave as a |
| "session start" flag. |
| |
| * VHOST_USER_RESET_OWNER |
| |
| Id: 4 |
| Master payload: N/A |
| |
| This is no longer used. Used to be sent to request disabling |
| all rings, but some clients interpreted it to also discard |
| connection state (this interpretation would lead to bugs). |
| It is recommended that clients either ignore this message, |
| or use it to disable all rings. |
| |
| * VHOST_USER_SET_MEM_TABLE |
| |
| Id: 5 |
| Equivalent ioctl: VHOST_SET_MEM_TABLE |
| Master payload: memory regions description |
| Slave payload: (postcopy only) memory regions description |
| |
| Sets the memory map regions on the slave so it can translate the vring |
| addresses. In the ancillary data there is an array of file descriptors |
| for each memory mapped region. The size and ordering of the fds matches |
| the number and ordering of memory regions. |
| |
| When VHOST_USER_POSTCOPY_LISTEN has been received, SET_MEM_TABLE replies with |
| the bases of the memory mapped regions to the master. The slave must |
| have mmap'd the regions but not yet accessed them and should not yet generate |
| a userfault event. Note NEED_REPLY_MASK is not set in this case. |
| QEMU will then reply back to the list of mappings with an empty |
| VHOST_USER_SET_MEM_TABLE as an acknowledgment; only upon reception of this |
| message may the guest start accessing the memory and generating faults. |
| |
| * VHOST_USER_SET_LOG_BASE |
| |
| Id: 6 |
| Equivalent ioctl: VHOST_SET_LOG_BASE |
| Master payload: u64 |
| Slave payload: N/A |
| |
| Sets logging shared memory space. |
| When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol |
| feature, the log memory fd is provided in the ancillary data of |
| VHOST_USER_SET_LOG_BASE message, the size and offset of shared |
| memory area provided in the message. |
| |
| |
| * VHOST_USER_SET_LOG_FD |
| |
| Id: 7 |
| Equivalent ioctl: VHOST_SET_LOG_FD |
| Master payload: N/A |
| |
| Sets the logging file descriptor, which is passed as ancillary data. |
| |
| * VHOST_USER_SET_VRING_NUM |
| |
| Id: 8 |
| Equivalent ioctl: VHOST_SET_VRING_NUM |
| Master payload: vring state description |
| |
| Set the size of the queue. |
| |
| * VHOST_USER_SET_VRING_ADDR |
| |
| Id: 9 |
| Equivalent ioctl: VHOST_SET_VRING_ADDR |
| Master payload: vring address description |
| Slave payload: N/A |
| |
| Sets the addresses of the different aspects of the vring. |
| |
| * VHOST_USER_SET_VRING_BASE |
| |
| Id: 10 |
| Equivalent ioctl: VHOST_SET_VRING_BASE |
| Master payload: vring state description |
| |
| Sets the base offset in the available vring. |
| |
| * VHOST_USER_GET_VRING_BASE |
| |
| Id: 11 |
| Equivalent ioctl: VHOST_USER_GET_VRING_BASE |
| Master payload: vring state description |
| Slave payload: vring state description |
| |
| Get the available vring base offset. |
| |
| * VHOST_USER_SET_VRING_KICK |
| |
| Id: 12 |
| Equivalent ioctl: VHOST_SET_VRING_KICK |
| Master payload: u64 |
| |
| Set the event file descriptor for adding buffers to the vring. It |
| is passed in the ancillary data. |
| Bits (0-7) of the payload contain the vring index. Bit 8 is the |
| invalid FD flag. This flag is set when there is no file descriptor |
| in the ancillary data. This signals that polling should be used |
| instead of waiting for a kick. |
| |
| * VHOST_USER_SET_VRING_CALL |
| |
| Id: 13 |
| Equivalent ioctl: VHOST_SET_VRING_CALL |
| Master payload: u64 |
| |
| Set the event file descriptor to signal when buffers are used. It |
| is passed in the ancillary data. |
| Bits (0-7) of the payload contain the vring index. Bit 8 is the |
| invalid FD flag. This flag is set when there is no file descriptor |
| in the ancillary data. This signals that polling will be used |
| instead of waiting for the call. |
| |
| * VHOST_USER_SET_VRING_ERR |
| |
| Id: 14 |
| Equivalent ioctl: VHOST_SET_VRING_ERR |
| Master payload: u64 |
| |
| Set the event file descriptor to signal when error occurs. It |
| is passed in the ancillary data. |
| Bits (0-7) of the payload contain the vring index. Bit 8 is the |
| invalid FD flag. This flag is set when there is no file descriptor |
| in the ancillary data. |
| |
| * VHOST_USER_GET_QUEUE_NUM |
| |
| Id: 17 |
| Equivalent ioctl: N/A |
| Master payload: N/A |
| Slave payload: u64 |
| |
| Query how many queues the backend supports. This request should be |
| sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol |
| features by VHOST_USER_GET_PROTOCOL_FEATURES. |
| |
| * VHOST_USER_SET_VRING_ENABLE |
| |
| Id: 18 |
| Equivalent ioctl: N/A |
| Master payload: vring state description |
| |
| Signal slave to enable or disable corresponding vring. |
| This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES |
| has been negotiated. |
| |
| * VHOST_USER_SEND_RARP |
| |
| Id: 19 |
| Equivalent ioctl: N/A |
| Master payload: u64 |
| |
| Ask vhost user backend to broadcast a fake RARP to notify the migration |
| is terminated for guest that does not support GUEST_ANNOUNCE. |
| Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in |
| VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP |
| is present in VHOST_USER_GET_PROTOCOL_FEATURES. |
| The first 6 bytes of the payload contain the mac address of the guest to |
| allow the vhost user backend to construct and broadcast the fake RARP. |
| |
| * VHOST_USER_NET_SET_MTU |
| |
| Id: 20 |
| Equivalent ioctl: N/A |
| Master payload: u64 |
| |
| Set host MTU value exposed to the guest. |
| This request should be sent only when VIRTIO_NET_F_MTU feature has been |
| successfully negotiated, VHOST_USER_F_PROTOCOL_FEATURES is present in |
| VHOST_USER_GET_FEATURES and protocol feature bit |
| VHOST_USER_PROTOCOL_F_NET_MTU is present in |
| VHOST_USER_GET_PROTOCOL_FEATURES. |
| If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond |
| with zero in case the specified MTU is valid, or non-zero otherwise. |
| |
| * VHOST_USER_SET_SLAVE_REQ_FD |
| |
| Id: 21 |
| Equivalent ioctl: N/A |
| Master payload: N/A |
| |
| Set the socket file descriptor for slave initiated requests. It is passed |
| in the ancillary data. |
| This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES |
| has been negotiated, and protocol feature bit VHOST_USER_PROTOCOL_F_SLAVE_REQ |
| bit is present in VHOST_USER_GET_PROTOCOL_FEATURES. |
| If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond |
| with zero for success, non-zero otherwise. |
| |
| * VHOST_USER_IOTLB_MSG |
| |
| Id: 22 |
| Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type) |
| Master payload: struct vhost_iotlb_msg |
| Slave payload: u64 |
| |
| Send IOTLB messages with struct vhost_iotlb_msg as payload. |
| Master sends such requests to update and invalidate entries in the device |
| IOTLB. The slave has to acknowledge the request with sending zero as u64 |
| payload for success, non-zero otherwise. |
| This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature |
| has been successfully negotiated. |
| |
| * VHOST_USER_SET_VRING_ENDIAN |
| |
| Id: 23 |
| Equivalent ioctl: VHOST_SET_VRING_ENDIAN |
| Master payload: vring state description |
| |
| Set the endianness of a VQ for legacy devices. Little-endian is indicated |
| with state.num set to 0 and big-endian is indicated with state.num set |
| to 1. Other values are invalid. |
| This request should be sent only when VHOST_USER_PROTOCOL_F_CROSS_ENDIAN |
| has been negotiated. |
| Backends that negotiated this feature should handle both endiannesses |
| and expect this message once (per VQ) during device configuration |
| (ie. before the master starts the VQ). |
| |
| * VHOST_USER_GET_CONFIG |
| |
| Id: 24 |
| Equivalent ioctl: N/A |
| Master payload: virtio device config space |
| Slave payload: virtio device config space |
| |
| When VHOST_USER_PROTOCOL_F_CONFIG is negotiated, this message is |
| submitted by the vhost-user master to fetch the contents of the virtio |
| device configuration space, vhost-user slave's payload size MUST match |
| master's request, vhost-user slave uses zero length of payload to |
| indicate an error to vhost-user master. The vhost-user master may |
| cache the contents to avoid repeated VHOST_USER_GET_CONFIG calls. |
| |
| * VHOST_USER_SET_CONFIG |
| |
| Id: 25 |
| Equivalent ioctl: N/A |
| Master payload: virtio device config space |
| Slave payload: N/A |
| |
| When VHOST_USER_PROTOCOL_F_CONFIG is negotiated, this message is |
| submitted by the vhost-user master when the Guest changes the virtio |
| device configuration space and also can be used for live migration |
| on the destination host. The vhost-user slave must check the flags |
| field, and slaves MUST NOT accept SET_CONFIG for read-only |
| configuration space fields unless the live migration bit is set. |
| |
| * VHOST_USER_CREATE_CRYPTO_SESSION |
| |
| Id: 26 |
| Equivalent ioctl: N/A |
| Master payload: crypto session description |
| Slave payload: crypto session description |
| |
| Create a session for crypto operation. The server side must return the |
| session id, 0 or positive for success, negative for failure. |
| This request should be sent only when VHOST_USER_PROTOCOL_F_CRYPTO_SESSION |
| feature has been successfully negotiated. |
| It's a required feature for crypto devices. |
| |
| * VHOST_USER_CLOSE_CRYPTO_SESSION |
| |
| Id: 27 |
| Equivalent ioctl: N/A |
| Master payload: u64 |
| |
| Close a session for crypto operation which was previously |
| created by VHOST_USER_CREATE_CRYPTO_SESSION. |
| This request should be sent only when VHOST_USER_PROTOCOL_F_CRYPTO_SESSION |
| feature has been successfully negotiated. |
| It's a required feature for crypto devices. |
| |
| * VHOST_USER_POSTCOPY_ADVISE |
| Id: 28 |
| Master payload: N/A |
| Slave payload: userfault fd |
| |
| When VHOST_USER_PROTOCOL_F_PAGEFAULT is supported, the |
| master advises slave that a migration with postcopy enabled is underway, |
| the slave must open a userfaultfd for later use. |
| Note that at this stage the migration is still in precopy mode. |
| |
| * VHOST_USER_POSTCOPY_LISTEN |
| Id: 29 |
| Master payload: N/A |
| |
| Master advises slave that a transition to postcopy mode has happened. |
| The slave must ensure that shared memory is registered with userfaultfd |
| to cause faulting of non-present pages. |
| |
| This is always sent sometime after a VHOST_USER_POSTCOPY_ADVISE, and |
| thus only when VHOST_USER_PROTOCOL_F_PAGEFAULT is supported. |
| |
| * VHOST_USER_POSTCOPY_END |
| Id: 30 |
| Slave payload: u64 |
| |
| Master advises that postcopy migration has now completed. The |
| slave must disable the userfaultfd. The response is an acknowledgement |
| only. |
| When VHOST_USER_PROTOCOL_F_PAGEFAULT is supported, this message |
| is sent at the end of the migration, after VHOST_USER_POSTCOPY_LISTEN |
| was previously sent. |
| The value returned is an error indication; 0 is success. |
| |
| * VHOST_USER_GET_INFLIGHT_FD |
| Id: 31 |
| Equivalent ioctl: N/A |
| Master payload: inflight description |
| |
| When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been |
| successfully negotiated, this message is submitted by master to get |
| a shared buffer from slave. The shared buffer will be used to track |
| inflight I/O by slave. QEMU should retrieve a new one when vm reset. |
| |
| * VHOST_USER_SET_INFLIGHT_FD |
| Id: 32 |
| Equivalent ioctl: N/A |
| Master payload: inflight description |
| |
| When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been |
| successfully negotiated, this message is submitted by master to send |
| the shared inflight buffer back to slave so that slave could get |
| inflight I/O after a crash or restart. |
| |
| Slave message types |
| ------------------- |
| |
| * VHOST_USER_SLAVE_IOTLB_MSG |
| |
| Id: 1 |
| Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type) |
| Slave payload: struct vhost_iotlb_msg |
| Master payload: N/A |
| |
| Send IOTLB messages with struct vhost_iotlb_msg as payload. |
| Slave sends such requests to notify of an IOTLB miss, or an IOTLB |
| access failure. If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, |
| and slave set the VHOST_USER_NEED_REPLY flag, master must respond with |
| zero when operation is successfully completed, or non-zero otherwise. |
| This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature |
| has been successfully negotiated. |
| |
| * VHOST_USER_SLAVE_CONFIG_CHANGE_MSG |
| |
| Id: 2 |
| Equivalent ioctl: N/A |
| Slave payload: N/A |
| Master payload: N/A |
| |
| When VHOST_USER_PROTOCOL_F_CONFIG is negotiated, vhost-user slave sends |
| such messages to notify that the virtio device's configuration space has |
| changed, for those host devices which can support such feature, host |
| driver can send VHOST_USER_GET_CONFIG message to slave to get the latest |
| content. If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, and slave set |
| the VHOST_USER_NEED_REPLY flag, master must respond with zero when |
| operation is successfully completed, or non-zero otherwise. |
| |
| * VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG |
| |
| Id: 3 |
| Equivalent ioctl: N/A |
| Slave payload: vring area description |
| Master payload: N/A |
| |
| Sets host notifier for a specified queue. The queue index is contained |
| in the u64 field of the vring area description. The host notifier is |
| described by the file descriptor (typically it's a VFIO device fd) which |
| is passed as ancillary data and the size (which is mmap size and should |
| be the same as host page size) and offset (which is mmap offset) carried |
| in the vring area description. QEMU can mmap the file descriptor based |
| on the size and offset to get a memory range. Registering a host notifier |
| means mapping this memory range to the VM as the specified queue's notify |
| MMIO region. Slave sends this request to tell QEMU to de-register the |
| existing notifier if any and register the new notifier if the request is |
| sent with a file descriptor. |
| This request should be sent only when VHOST_USER_PROTOCOL_F_HOST_NOTIFIER |
| protocol feature has been successfully negotiated. |
| |
| VHOST_USER_PROTOCOL_F_REPLY_ACK: |
| ------------------------------- |
| The original vhost-user specification only demands replies for certain |
| commands. This differs from the vhost protocol implementation where commands |
| are sent over an ioctl() call and block until the client has completed. |
| |
| With this protocol extension negotiated, the sender (QEMU) can set the |
| "need_reply" [Bit 3] flag to any command. This indicates that |
| the client MUST respond with a Payload VhostUserMsg indicating success or |
| failure. The payload should be set to zero on success or non-zero on failure, |
| unless the message already has an explicit reply body. |
| |
| The response payload gives QEMU a deterministic indication of the result |
| of the command. Today, QEMU is expected to terminate the main vhost-user |
| loop upon receiving such errors. In future, qemu could be taught to be more |
| resilient for selective requests. |
| |
| For the message types that already solicit a reply from the client, the |
| presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings |
| no behavioural change. (See the 'Communication' section for details.) |
| |
| Backend program conventions |
| --------------------------- |
| |
| vhost-user backends can provide various devices & services and may |
| need to be configured manually depending on the use case. However, it |
| is a good idea to follow the conventions listed here when |
| possible. Users, QEMU or libvirt, can then rely on some common |
| behaviour to avoid heterogenous configuration and management of the |
| backend programs and facilitate interoperability. |
| |
| Each backend installed on a host system should come with at least one |
| JSON file that conforms to the vhost-user.json schema. Each file |
| informs the management applications about the backend type, and binary |
| location. In addition, it defines rules for management apps for |
| picking the highest priority backend when multiple match the search |
| criteria (see @VhostUserBackend documentation in the schema file). |
| |
| If the backend is not capable of enabling a requested feature on the |
| host (such as 3D acceleration with virgl), or the initialization |
| failed, the backend should fail to start early and exit with a status |
| != 0. It may also print a message to stderr for further details. |
| |
| The backend program must not daemonize itself, but it may be |
| daemonized by the management layer. It may also have a restricted |
| access to the system. |
| |
| File descriptors 0, 1 and 2 will exist, and have regular |
| stdin/stdout/stderr usage (they may have been redirected to /dev/null |
| by the management layer, or to a log handler). |
| |
| The backend program must end (as quickly and cleanly as possible) when |
| the SIGTERM signal is received. Eventually, it may receive SIGKILL by |
| the management layer after a few seconds. |
| |
| The following command line options have an expected behaviour. They |
| are mandatory, unless explicitly said differently: |
| |
| * --socket-path=PATH |
| |
| This option specify the location of the vhost-user Unix domain socket. |
| It is incompatible with --fd. |
| |
| * --fd=FDNUM |
| |
| When this argument is given, the backend program is started with the |
| vhost-user socket as file descriptor FDNUM. It is incompatible with |
| --socket-path. |
| |
| * --print-capabilities |
| |
| Output to stdout the backend capabilities in JSON format, and then |
| exit successfully. Other options and arguments should be ignored, and |
| the backend program should not perform its normal function. The |
| capabilities can be reported dynamically depending on the host |
| capabilities. |
| |
| The JSON output is described in the vhost-user.json schema, by |
| @VHostUserBackendCapabilities. Example: |
| { |
| "type": "foo", |
| "features": [ |
| "feature-a", |
| "feature-b" |
| ] |
| } |
| |
| vhost-user-input |
| ---------------- |
| |
| Command line options: |
| |
| * --evdev-path=PATH (optional) |
| |
| Specify the linux input device. |
| |
| * --no-grab (optional) |
| |
| Do no request exclusive access to the input device. |
| |
| vhost-user-gpu |
| -------------- |
| |
| Command line options: |
| |
| * --render-node=PATH (optional) |
| |
| Specify the GPU DRM render node. |
| |
| * --virgl (optional) |
| |
| Enable virgl rendering support. |