| |
| Device Specification for Inter-VM shared memory device |
| ------------------------------------------------------ |
| |
| The Inter-VM shared memory device is designed to share a memory region (created |
| on the host via the POSIX shared memory API) between multiple QEMU processes |
| running different guests. In order for all guests to be able to pick up the |
| shared memory area, it is modeled by QEMU as a PCI device exposing said memory |
| to the guest as a PCI BAR. |
| The memory region does not belong to any guest, but is a POSIX memory object on |
| the host. The host can access this shared memory if needed. |
| |
| The device also provides an optional communication mechanism between guests |
| sharing the same memory object. More details about that in the section 'Guest to |
| guest communication' section. |
| |
| |
| The Inter-VM PCI device |
| ----------------------- |
| |
| From the VM point of view, the ivshmem PCI device supports three BARs. |
| |
| - BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is |
| not used. |
| - BAR1 is used for MSI-X when it is enabled in the device. |
| - BAR2 is used to access the shared memory object. |
| |
| It is your choice how to use the device but you must choose between two |
| behaviors : |
| |
| - basically, if you only need the shared memory part, you will map BAR2. |
| This way, you have access to the shared memory in guest and can use it as you |
| see fit (memnic, for example, uses it in userland |
| http://dpdk.org/browse/memnic). |
| |
| - BAR0 and BAR1 are used to implement an optional communication mechanism |
| through interrupts in the guests. If you need an event mechanism between the |
| guests accessing the shared memory, you will most likely want to write a |
| kernel driver that will handle interrupts. See details in the section 'Guest |
| to guest communication' section. |
| |
| The behavior is chosen when starting your QEMU processes: |
| - no communication mechanism needed, the first QEMU to start creates the shared |
| memory on the host, subsequent QEMU processes will use it. |
| |
| - communication mechanism needed, an ivshmem server must be started before any |
| QEMU processes, then each QEMU process connects to the server unix socket. |
| |
| For more details on the QEMU ivshmem parameters, see qemu-doc documentation. |
| |
| |
| Guest to guest communication |
| ---------------------------- |
| |
| This section details the communication mechanism between the guests accessing |
| the ivhsmem shared memory. |
| |
| *ivshmem server* |
| |
| This server code is available in qemu.git/contrib/ivshmem-server. |
| |
| The server must be started on the host before any guest. |
| It creates a shared memory object then waits for clients to connect on a unix |
| socket. All the messages are little-endian int64_t integer. |
| |
| For each client (QEMU process) that connects to the server: |
| - the server sends a protocol version, if client does not support it, the client |
| closes the communication, |
| - the server assigns an ID for this client and sends this ID to him as the first |
| message, |
| - the server sends a fd to the shared memory object to this client, |
| - the server creates a new set of host eventfds associated to the new client and |
| sends this set to all already connected clients, |
| - finally, the server sends all the eventfds sets for all clients to the new |
| client. |
| |
| The server signals all clients when one of them disconnects. |
| |
| The client IDs are limited to 16 bits because of the current implementation (see |
| Doorbell register in 'PCI device registers' subsection). Hence only 65536 |
| clients are supported. |
| |
| All the file descriptors (fd to the shared memory, eventfds for each client) |
| are passed to clients using SCM_RIGHTS over the server unix socket. |
| |
| Apart from the current ivshmem implementation in QEMU, an ivshmem client has |
| been provided in qemu.git/contrib/ivshmem-client for debug. |
| |
| *QEMU as an ivshmem client* |
| |
| At initialisation, when creating the ivshmem device, QEMU first receives a |
| protocol version and closes communication with server if it does not match. |
| Then, QEMU gets its ID from the server then makes it available through BAR0 |
| IVPosition register for the VM to use (see 'PCI device registers' subsection). |
| QEMU then uses the fd to the shared memory to map it to BAR2. |
| eventfds for all other clients received from the server are stored to implement |
| BAR0 Doorbell register (see 'PCI device registers' subsection). |
| Finally, eventfds assigned to this QEMU process are used to send interrupts in |
| this VM. |
| |
| *PCI device registers* |
| |
| From the VM point of view, the ivshmem PCI device supports 4 registers of |
| 32-bits each. |
| |
| enum ivshmem_registers { |
| IntrMask = 0, |
| IntrStatus = 4, |
| IVPosition = 8, |
| Doorbell = 12 |
| }; |
| |
| The first two registers are the interrupt mask and status registers. Mask and |
| status are only used with pin-based interrupts. They are unused with MSI |
| interrupts. |
| |
| Status Register: The status register is set to 1 when an interrupt occurs. |
| |
| Mask Register: The mask register is bitwise ANDed with the interrupt status |
| and the result will raise an interrupt if it is non-zero. However, since 1 is |
| the only value the status will be set to, it is only the first bit of the mask |
| that has any effect. Therefore interrupts can be masked by setting the first |
| bit to 0 and unmasked by setting the first bit to 1. |
| |
| IVPosition Register: The IVPosition register is read-only and reports the |
| guest's ID number. The guest IDs are non-negative integers. When using the |
| server, since the server is a separate process, the VM ID will only be set when |
| the device is ready (shared memory is received from the server and accessible |
| via the device). If the device is not ready, the IVPosition will return -1. |
| Applications should ensure that they have a valid VM ID before accessing the |
| shared memory. |
| |
| Doorbell Register: To interrupt another guest, a guest must write to the |
| Doorbell register. The doorbell register is 32-bits, logically divided into |
| two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low |
| 16-bits are the interrupt vector to trigger. The semantics of the value |
| written to the doorbell depends on whether the device is using MSI or a regular |
| pin-based interrupt. In short, MSI uses vectors while regular interrupts set |
| the status register. |
| |
| Regular Interrupts |
| |
| If regular interrupts are used (due to either a guest not supporting MSI or the |
| user specifying not to use them on startup) then the value written to the lower |
| 16-bits of the Doorbell register results is arbitrary and will trigger an |
| interrupt in the destination guest. |
| |
| Message Signalled Interrupts |
| |
| An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits |
| written to the Doorbell register must be between 0 and the maximum number of |
| vectors the guest supports. The lower 16 bits written to the doorbell is the |
| MSI vector that will be raised in the destination guest. The number of MSI |
| vectors is configurable but it is set when the VM is started. |
| |
| The important thing to remember with MSI is that it is only a signal, no status |
| is set (since MSI interrupts are not shared). All information other than the |
| interrupt itself should be communicated via the shared memory region. Devices |
| supporting multiple MSI vectors can use different vectors to indicate different |
| events have occurred. The semantics of interrupt vectors are left to the |
| user's discretion. |