Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 1 | ====================================================== |
| 2 | Device Specification for Inter-VM shared memory device |
| 3 | ====================================================== |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 4 | |
| 5 | The Inter-VM shared memory device (ivshmem) is designed to share a |
| 6 | memory region between multiple QEMU processes running different guests |
| 7 | and the host. In order for all guests to be able to pick up the |
| 8 | shared memory area, it is modeled by QEMU as a PCI device exposing |
| 9 | said memory to the guest as a PCI BAR. |
| 10 | |
| 11 | The device can use a shared memory object on the host directly, or it |
| 12 | can obtain one from an ivshmem server. |
| 13 | |
| 14 | In the latter case, the device can additionally interrupt its peers, and |
| 15 | get interrupted by its peers. |
| 16 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 17 | For information on configuring the ivshmem device on the QEMU |
| 18 | command line, see :doc:`../system/devices/ivshmem`. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 19 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 20 | The ivshmem PCI device's guest interface |
| 21 | ======================================== |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 22 | |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 23 | The device has vendor ID 1af4, device ID 1110, revision 1. Before |
| 24 | QEMU 2.6.0, it had revision 0. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 25 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 26 | PCI BARs |
| 27 | -------- |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 28 | |
| 29 | The ivshmem PCI device has two or three BARs: |
| 30 | |
| 31 | - BAR0 holds device registers (256 Byte MMIO) |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 32 | - BAR1 holds MSI-X table and PBA (only ivshmem-doorbell) |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 33 | - BAR2 maps the shared memory object |
| 34 | |
| 35 | There are two ways to use this device: |
| 36 | |
| 37 | - If you only need the shared memory part, BAR2 suffices. This way, |
| 38 | you have access to the shared memory in the guest and can use it as |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 39 | you see fit. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 40 | |
| 41 | - If you additionally need the capability for peers to interrupt each |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 42 | other, you need BAR0 and BAR1. You will most likely want to write a |
| 43 | kernel driver to handle interrupts. Requires the device to be |
| 44 | configured for interrupts, obviously. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 45 | |
Markus Armbruster | 1309cf4 | 2016-03-15 19:34:41 +0100 | [diff] [blame] | 46 | Before QEMU 2.6.0, BAR2 can initially be invalid if the device is |
| 47 | configured for interrupts. It becomes safely accessible only after |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 48 | the ivshmem server provided the shared memory. These devices have PCI |
| 49 | revision 0 rather than 1. Guest software should wait for the |
| 50 | IVPosition register (described below) to become non-negative before |
| 51 | accessing BAR2. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 52 | |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 53 | Revision 0 of the device is not capable to tell guest software whether |
| 54 | it is configured for interrupts. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 55 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 56 | PCI device registers |
| 57 | -------------------- |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 58 | |
| 59 | BAR 0 contains the following registers: |
| 60 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 61 | :: |
| 62 | |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 63 | Offset Size Access On reset Function |
| 64 | 0 4 read/write 0 Interrupt Mask |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 65 | bit 0: peer interrupt (rev 0) |
| 66 | reserved (rev 1) |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 67 | bit 1..31: reserved |
| 68 | 4 4 read/write 0 Interrupt Status |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 69 | bit 0: peer interrupt (rev 0) |
| 70 | reserved (rev 1) |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 71 | bit 1..31: reserved |
Markus Armbruster | 1309cf4 | 2016-03-15 19:34:41 +0100 | [diff] [blame] | 72 | 8 4 read-only 0 or ID IVPosition |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 73 | 12 4 write-only N/A Doorbell |
| 74 | bit 0..15: vector |
| 75 | bit 16..31: peer ID |
| 76 | 16 240 none N/A reserved |
| 77 | |
| 78 | Software should only access the registers as specified in column |
| 79 | "Access". Reserved bits should be ignored on read, and preserved on |
| 80 | write. |
| 81 | |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 82 | In revision 0 of the device, Interrupt Status and Mask Register |
| 83 | together control the legacy INTx interrupt when the device has no |
| 84 | MSI-X capability: INTx is asserted when the bit-wise AND of Status and |
| 85 | Mask is non-zero and the device has no MSI-X capability. Interrupt |
| 86 | Status Register bit 0 becomes 1 when an interrupt request from a peer |
| 87 | is received. Reading the register clears it. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 88 | |
| 89 | IVPosition Register: if the device is not configured for interrupts, |
Markus Armbruster | 1309cf4 | 2016-03-15 19:34:41 +0100 | [diff] [blame] | 90 | this is zero. Else, it is the device's ID (between 0 and 65535). |
| 91 | |
| 92 | Before QEMU 2.6.0, the register may read -1 for a short while after |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 93 | reset. These devices have PCI revision 0 rather than 1. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 94 | |
| 95 | There is no good way for software to find out whether the device is |
| 96 | configured for interrupts. A positive IVPosition means interrupts, |
Markus Armbruster | 1309cf4 | 2016-03-15 19:34:41 +0100 | [diff] [blame] | 97 | but zero could be either. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 98 | |
| 99 | Doorbell Register: writing this register requests to interrupt a peer. |
| 100 | The written value's high 16 bits are the ID of the peer to interrupt, |
| 101 | and its low 16 bits select an interrupt vector. |
| 102 | |
| 103 | If the device is not configured for interrupts, the write is ignored. |
| 104 | |
| 105 | If the interrupt hasn't completed setup, the write is ignored. The |
| 106 | device is not capable to tell guest software whether setup is |
| 107 | complete. Interrupts can regress to this state on migration. |
| 108 | |
| 109 | If the peer with the requested ID isn't connected, or it has fewer |
| 110 | interrupt vectors connected, the write is ignored. The device is not |
| 111 | capable to tell guest software what peers are connected, or how many |
| 112 | interrupt vectors are connected. |
| 113 | |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 114 | The peer's interrupt for this vector then becomes pending. There is |
| 115 | no way for software to clear the pending bit, and a polling mode of |
| 116 | operation is therefore impossible. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 117 | |
Markus Armbruster | 5400c02 | 2016-03-15 19:34:51 +0100 | [diff] [blame] | 118 | If the peer is a revision 0 device without MSI-X capability, its |
| 119 | Interrupt Status register is set to 1. This asserts INTx unless |
| 120 | masked by the Interrupt Mask register. The device is not capable to |
| 121 | communicate the interrupt vector to guest software then. |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 122 | |
| 123 | With multiple MSI-X vectors, different vectors can be used to indicate |
| 124 | different events have occurred. The semantics of interrupt vectors |
| 125 | are left to the application. |
| 126 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 127 | Interrupt infrastructure |
| 128 | ======================== |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 129 | |
| 130 | When configured for interrupts, the peers share eventfd objects in |
| 131 | addition to shared memory. The shared resources are managed by an |
| 132 | ivshmem server. |
| 133 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 134 | The ivshmem server |
| 135 | ------------------ |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 136 | |
| 137 | The server listens on a UNIX domain socket. |
| 138 | |
| 139 | For each new client that connects to the server, the server |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 140 | |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 141 | - picks an ID, |
| 142 | - creates eventfd file descriptors for the interrupt vectors, |
| 143 | - sends the ID and the file descriptor for the shared memory to the |
| 144 | new client, |
| 145 | - sends connect notifications for the new client to the other clients |
| 146 | (these contain file descriptors for sending interrupts), |
| 147 | - sends connect notifications for the other clients to the new client, |
| 148 | and |
| 149 | - sends interrupt setup messages to the new client (these contain file |
| 150 | descriptors for receiving interrupts). |
| 151 | |
Markus Armbruster | 62a830b | 2016-03-15 19:34:54 +0100 | [diff] [blame] | 152 | The first client to connect to the server receives ID zero. |
| 153 | |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 154 | When a client disconnects from the server, the server sends disconnect |
| 155 | notifications to the other clients. |
| 156 | |
| 157 | The next section describes the protocol in detail. |
| 158 | |
| 159 | If the server terminates without sending disconnect notifications for |
| 160 | its connected clients, the clients can elect to continue. They can |
| 161 | communicate with each other normally, but won't receive disconnect |
| 162 | notification on disconnect, and no new clients can connect. There is |
| 163 | no way for the clients to connect to a restarted server. The device |
| 164 | is not capable to tell guest software whether the server is still up. |
| 165 | |
| 166 | Example server code is in contrib/ivshmem-server/. Not to be used in |
| 167 | production. It assumes all clients use the same number of interrupt |
| 168 | vectors. |
| 169 | |
| 170 | A standalone client is in contrib/ivshmem-client/. It can be useful |
| 171 | for debugging. |
| 172 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 173 | The ivshmem Client-Server Protocol |
| 174 | ---------------------------------- |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 175 | |
| 176 | An ivshmem device configured for interrupts connects to an ivshmem |
| 177 | server. This section details the protocol between the two. |
| 178 | |
| 179 | The connection is one-way: the server sends messages to the client. |
| 180 | Each message consists of a single 8 byte little-endian signed number, |
| 181 | and may be accompanied by a file descriptor via SCM_RIGHTS. Both |
| 182 | client and server close the connection on error. |
| 183 | |
Markus Armbruster | 71c2658 | 2016-03-15 19:34:30 +0100 | [diff] [blame] | 184 | Note: QEMU currently doesn't close the connection right on error, but |
| 185 | only when the character device is destroyed. |
| 186 | |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 187 | On connect, the server sends the following messages in order: |
| 188 | |
| 189 | 1. The protocol version number, currently zero. The client should |
| 190 | close the connection on receipt of versions it can't handle. |
| 191 | |
| 192 | 2. The client's ID. This is unique among all clients of this server. |
| 193 | IDs must be between 0 and 65535, because the Doorbell register |
| 194 | provides only 16 bits for them. |
| 195 | |
| 196 | 3. The number -1, accompanied by the file descriptor for the shared |
| 197 | memory. |
| 198 | |
| 199 | 4. Connect notifications for existing other clients, if any. This is |
| 200 | a peer ID (number between 0 and 65535 other than the client's ID), |
| 201 | repeated N times. Each repetition is accompanied by one file |
| 202 | descriptor. These are for interrupting the peer with that ID using |
| 203 | vector 0,..,N-1, in order. If the client is configured for fewer |
| 204 | vectors, it closes the extra file descriptors. If it is configured |
| 205 | for more, the extra vectors remain unconnected. |
| 206 | |
| 207 | 5. Interrupt setup. This is the client's own ID, repeated N times. |
| 208 | Each repetition is accompanied by one file descriptor. These are |
| 209 | for receiving interrupts from peers using vector 0,..,N-1, in |
| 210 | order. If the client is configured for fewer vectors, it closes |
| 211 | the extra file descriptors. If it is configured for more, the |
| 212 | extra vectors remain unconnected. |
| 213 | |
| 214 | From then on, the server sends these kinds of messages: |
| 215 | |
| 216 | 6. Connection / disconnection notification. This is a peer ID. |
| 217 | |
| 218 | - If the number comes with a file descriptor, it's a connection |
| 219 | notification, exactly like in step 4. |
| 220 | |
| 221 | - Else, it's a disconnection notification for the peer with that ID. |
| 222 | |
| 223 | Known bugs: |
| 224 | |
| 225 | * The protocol changed incompatibly in QEMU 2.5. Before, messages |
| 226 | were native endian long, and there was no version number. |
| 227 | |
| 228 | * The protocol is poorly designed. |
| 229 | |
Peter Maydell | bb1cff6 | 2023-09-27 16:12:00 +0100 | [diff] [blame] | 230 | The ivshmem Client-Client Protocol |
| 231 | ---------------------------------- |
Markus Armbruster | fdee202 | 2016-03-15 19:34:25 +0100 | [diff] [blame] | 232 | |
| 233 | An ivshmem device configured for interrupts receives eventfd file |
| 234 | descriptors for interrupting peers and getting interrupted by peers |
| 235 | from the server, as explained in the previous section. |
| 236 | |
| 237 | To interrupt a peer, the device writes the 8-byte integer 1 in native |
| 238 | byte order to the respective file descriptor. |
| 239 | |
| 240 | To receive an interrupt, the device reads and discards as many 8-byte |
| 241 | integers as it can. |