Marcel Apfelbaum | edab563 | 2017-12-14 17:26:37 +0200 | [diff] [blame] | 1 | Paravirtualized RDMA Device (PVRDMA) |
| 2 | ==================================== |
| 3 | |
| 4 | |
| 5 | 1. Description |
| 6 | =============== |
| 7 | PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. |
| 8 | It works with its Linux Kernel driver AS IS, no need for any special guest |
| 9 | modifications. |
| 10 | |
| 11 | While it complies with the VMware device, it can also communicate with bare |
Yuval Shaia | 46b69a8 | 2018-12-21 16:40:37 +0200 | [diff] [blame] | 12 | metal RDMA-enabled machines as peers. |
| 13 | |
| 14 | It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe). |
Marcel Apfelbaum | edab563 | 2017-12-14 17:26:37 +0200 | [diff] [blame] | 15 | |
| 16 | It does not require the whole guest RAM to be pinned allowing memory |
| 17 | over-commit and, even if not implemented yet, migration support will be |
| 18 | possible with some HW assistance. |
| 19 | |
| 20 | A project presentation accompany this document: |
Han Han | 4aeae1d | 2020-08-07 18:17:36 +0800 | [diff] [blame] | 21 | - https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf |
Marcel Apfelbaum | edab563 | 2017-12-14 17:26:37 +0200 | [diff] [blame] | 22 | |
| 23 | |
| 24 | |
| 25 | 2. Setup |
| 26 | ======== |
| 27 | |
| 28 | |
| 29 | 2.1 Guest setup |
| 30 | =============== |
| 31 | Fedora 27+ kernels work out of the box, older distributions |
| 32 | require updating the kernel to 4.14 to include the pvrdma driver. |
| 33 | |
| 34 | However the libpvrdma library needed by User Level Software is still |
| 35 | not available as part of the distributions, so the rdma-core library |
| 36 | needs to be compiled and optionally installed. |
| 37 | |
| 38 | Please follow the instructions at: |
| 39 | https://github.com/linux-rdma/rdma-core.git |
| 40 | |
| 41 | |
| 42 | 2.2 Host Setup |
| 43 | ============== |
| 44 | The pvrdma backend is an ibdevice interface that can be exposed |
| 45 | either by a Soft-RoCE(rxe) device on machines with no RDMA device, |
| 46 | or an HCA SRIOV function(VF/PF). |
| 47 | Note that ibdevice interfaces can't be shared between pvrdma devices, |
| 48 | each one requiring a separate instance (rxe or SRIOV VF). |
| 49 | |
| 50 | |
| 51 | 2.2.1 Soft-RoCE backend(rxe) |
| 52 | =========================== |
| 53 | A stable version of rxe is required, Fedora 27+ or a Linux |
| 54 | Kernel 4.14+ is preferred. |
| 55 | |
| 56 | The rdma_rxe module is part of the Linux Kernel but not loaded by default. |
| 57 | Install the User Level library (librxe) following the instructions from: |
| 58 | https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home |
| 59 | |
| 60 | Associate an ETH interface with rxe by running: |
| 61 | rxe_cfg add eth0 |
| 62 | An rxe0 ibdevice interface will be created and can be used as pvrdma backend. |
| 63 | |
| 64 | |
| 65 | 2.2.2 RDMA device Virtual Function backend |
| 66 | ========================================== |
| 67 | Nothing special is required, the pvrdma device can work not only with |
| 68 | Ethernet Links, but also Infinibands Links. |
| 69 | All is needed is an ibdevice with an active port, for Mellanox cards |
| 70 | will be something like mlx5_6 which can be the backend. |
| 71 | |
| 72 | |
| 73 | 2.2.3 QEMU setup |
| 74 | ================ |
| 75 | Configure QEMU with --enable-rdma flag, installing |
| 76 | the required RDMA libraries. |
| 77 | |
| 78 | |
| 79 | |
| 80 | 3. Usage |
| 81 | ======== |
Yuval Shaia | 46b69a8 | 2018-12-21 16:40:37 +0200 | [diff] [blame] | 82 | |
| 83 | |
| 84 | 3.1 VM Memory settings |
| 85 | ====================== |
Marcel Apfelbaum | edab563 | 2017-12-14 17:26:37 +0200 | [diff] [blame] | 86 | Currently the device is working only with memory backed RAM |
| 87 | and it must be mark as "shared": |
| 88 | -m 1G \ |
| 89 | -object memory-backend-ram,id=mb1,size=1G,share \ |
| 90 | -numa node,memdev=mb1 \ |
| 91 | |
Yuval Shaia | 46b69a8 | 2018-12-21 16:40:37 +0200 | [diff] [blame] | 92 | |
| 93 | 3.2 MAD Multiplexer |
| 94 | =================== |
| 95 | MAD Multiplexer is a service that exposes MAD-like interface for VMs in |
| 96 | order to overcome the limitation where only single entity can register with |
| 97 | MAD layer to send and receive RDMA-CM MAD packets. |
| 98 | |
| 99 | To build rdmacm-mux run |
| 100 | # make rdmacm-mux |
| 101 | |
Kamal Heib | 5a301bb | 2019-01-09 15:28:29 +0200 | [diff] [blame] | 102 | Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel |
| 103 | modules aren't loaded, otherwise the rdmacm-mux service will fail to start. |
| 104 | |
Yuval Shaia | 46b69a8 | 2018-12-21 16:40:37 +0200 | [diff] [blame] | 105 | The application accepts 3 command line arguments and exposes a UNIX socket |
| 106 | to pass control and data to it. |
| 107 | -d rdma-device-name Name of RDMA device to register with |
| 108 | -s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux) |
| 109 | -p rdma-device-port Port number of RDMA device to register with (default 1) |
| 110 | The final UNIX socket file name is a concatenation of the 3 arguments so |
| 111 | for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2 |
| 112 | will be created. |
| 113 | |
| 114 | pvrdma requires this service. |
| 115 | |
| 116 | Please refer to contrib/rdmacm-mux for more details. |
| 117 | |
| 118 | |
| 119 | 3.3 Service exposed by libvirt daemon |
| 120 | ===================================== |
| 121 | The control over the RDMA device's GID table is done by updating the |
| 122 | device's Ethernet function addresses. |
| 123 | Usually the first GID entry is determined by the MAC address, the second by |
| 124 | the first IPv6 address and the third by the IPv4 address. Other entries can |
| 125 | be added by adding more IP addresses. The opposite is the same, i.e. |
| 126 | whenever an address is removed, the corresponding GID entry is removed. |
| 127 | The process is done by the network and RDMA stacks. Whenever an address is |
| 128 | added the ib_core driver is notified and calls the device driver add_gid |
| 129 | function which in turn update the device. |
| 130 | To support this in pvrdma device the device hooks into the create_bind and |
| 131 | destroy_bind HW commands triggered by pvrdma driver in guest. |
| 132 | |
| 133 | Whenever changed is made to the pvrdma port's GID table a special QMP |
| 134 | messages is sent to be processed by libvirt to update the address of the |
| 135 | backend Ethernet device. |
| 136 | |
| 137 | pvrdma requires that libvirt service will be up. |
| 138 | |
| 139 | |
| 140 | 3.4 PCI devices settings |
| 141 | ======================== |
| 142 | RoCE device exposes two functions - an Ethernet and RDMA. |
| 143 | To support it, pvrdma device is composed of two PCI functions, an Ethernet |
| 144 | device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The |
| 145 | Ethernet function can be used for other Ethernet purposes such as IP. |
| 146 | |
| 147 | |
| 148 | 3.5 Device parameters |
| 149 | ===================== |
| 150 | - netdev: Specifies the Ethernet device function name on the host for |
| 151 | example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet |
| 152 | device used to create it. |
| 153 | - ibdev: The IB device name on host for example rxe0, mlx5_0 etc. |
| 154 | - mad-chardev: The name of the MAD multiplexer char device. |
| 155 | - ibport: In case of multi-port device (such as Mellanox's HCA) this |
| 156 | specify the port to use. If not set 1 will be used. |
| 157 | - dev-caps-max-mr-size: The maximum size of MR. |
| 158 | - dev-caps-max-qp: Maximum number of QPs. |
Yuval Shaia | 46b69a8 | 2018-12-21 16:40:37 +0200 | [diff] [blame] | 159 | - dev-caps-max-cq: Maximum number of CQs. |
| 160 | - dev-caps-max-mr: Maximum number of MRs. |
| 161 | - dev-caps-max-pd: Maximum number of PDs. |
| 162 | - dev-caps-max-ah: Maximum number of AHs. |
| 163 | |
| 164 | Notes: |
| 165 | - The first 3 parameters are mandatory settings, the rest have their |
| 166 | defaults. |
| 167 | - The last 8 parameters (the ones that prefixed by dev-caps) defines the top |
| 168 | limits but the final values is adjusted by the backend device limitations. |
| 169 | - netdev can be extracted from ibdev's sysfs |
| 170 | (/sys/class/infiniband/<ibdev>/device/net/) |
| 171 | |
| 172 | |
| 173 | 3.6 Example |
| 174 | =========== |
| 175 | Define bridge device with vmxnet3 network backend: |
| 176 | <interface type='bridge'> |
| 177 | <mac address='56:b4:44:e9:62:dc'/> |
| 178 | <source bridge='bridge1'/> |
| 179 | <model type='vmxnet3'/> |
| 180 | <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/> |
| 181 | </interface> |
| 182 | |
| 183 | Define pvrdma device: |
| 184 | <qemu:commandline> |
| 185 | <qemu:arg value='-object'/> |
| 186 | <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/> |
| 187 | <qemu:arg value='-numa'/> |
| 188 | <qemu:arg value='node,memdev=mb1'/> |
| 189 | <qemu:arg value='-chardev'/> |
| 190 | <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/> |
| 191 | <qemu:arg value='-device'/> |
| 192 | <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/> |
| 193 | </qemu:commandline> |
Marcel Apfelbaum | edab563 | 2017-12-14 17:26:37 +0200 | [diff] [blame] | 194 | |
| 195 | |
| 196 | |
| 197 | 4. Implementation details |
| 198 | ========================= |
| 199 | |
| 200 | |
| 201 | 4.1 Overview |
| 202 | ============ |
| 203 | The device acts like a proxy between the Guest Driver and the host |
| 204 | ibdevice interface. |
| 205 | On configuration path: |
| 206 | - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request |
| 207 | a resource from the backend interface, maintaining a 1-1 mapping |
| 208 | between the guest and host. |
| 209 | On data path: |
| 210 | - Every post_send/receive received from the guest will be converted into |
| 211 | a post_send/receive for the backend. The buffers data will not be touched |
| 212 | or copied resulting in near bare-metal performance for large enough buffers. |
| 213 | - Completions from the backend interface will result in completions for |
| 214 | the pvrdma device. |
| 215 | |
| 216 | |
| 217 | 4.2 PCI BARs |
| 218 | ============ |
| 219 | PCI Bars: |
| 220 | BAR 0 - MSI-X |
| 221 | MSI-X vectors: |
| 222 | (0) Command - used when execution of a command is completed. |
| 223 | (1) Async - not in use. |
| 224 | (2) Completion - used when a completion event is placed in |
| 225 | device's CQ ring. |
| 226 | BAR 1 - Registers |
| 227 | -------------------------------------------------------- |
| 228 | | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | |
| 229 | -------------------------------------------------------- |
| 230 | DSR - Address of driver/device shared memory used |
| 231 | for the command channel, used for passing: |
| 232 | - General info such as driver version |
| 233 | - Address of 'command' and 'response' |
| 234 | - Address of async ring |
| 235 | - Address of device's CQ ring |
| 236 | - Device capabilities |
| 237 | CTL - Device control operations (activate, reset etc) |
| 238 | IMG - Set interrupt mask |
| 239 | REQ - Command execution register |
| 240 | ERR - Operation status |
| 241 | |
| 242 | BAR 2 - UAR |
| 243 | --------------------------------------------------------- |
| 244 | | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | |
| 245 | --------------------------------------------------------- |
| 246 | - Offset 0 used for QP operations (send and recv) |
| 247 | - Offset 4 used for CQ operations (arm and poll) |
| 248 | |
| 249 | |
| 250 | 4.3 Major flows |
| 251 | =============== |
| 252 | |
| 253 | 4.3.1 Create CQ |
| 254 | =============== |
| 255 | - Guest driver |
| 256 | - Allocates pages for CQ ring |
| 257 | - Creates page directory (pdir) to hold CQ ring's pages |
| 258 | - Initializes CQ ring |
| 259 | - Initializes 'Create CQ' command object (cqe, pdir etc) |
| 260 | - Copies the command to 'command' address |
| 261 | - Writes 0 into REQ register |
| 262 | - Device |
| 263 | - Reads the request object from the 'command' address |
| 264 | - Allocates CQ object and initialize CQ ring based on pdir |
| 265 | - Creates the backend CQ |
| 266 | - Writes operation status to ERR register |
| 267 | - Posts command-interrupt to guest |
| 268 | - Guest driver |
| 269 | - Reads the HW response code from ERR register |
| 270 | |
| 271 | 4.3.2 Create QP |
| 272 | =============== |
| 273 | - Guest driver |
| 274 | - Allocates pages for send and receive rings |
| 275 | - Creates page directory(pdir) to hold the ring's pages |
| 276 | - Initializes 'Create QP' command object (max_send_wr, |
| 277 | send_cq_handle, recv_cq_handle, pdir etc) |
| 278 | - Copies the object to 'command' address |
| 279 | - Write 0 into REQ register |
| 280 | - Device |
| 281 | - Reads the request object from 'command' address |
| 282 | - Allocates the QP object and initialize |
| 283 | - Send and recv rings based on pdir |
| 284 | - Send and recv ring state |
| 285 | - Creates the backend QP |
| 286 | - Writes the operation status to ERR register |
| 287 | - Posts command-interrupt to guest |
| 288 | - Guest driver |
| 289 | - Reads the HW response code from ERR register |
| 290 | |
| 291 | 4.3.3 Post receive |
| 292 | ================== |
| 293 | - Guest driver |
| 294 | - Initializes a wqe and place it on recv ring |
| 295 | - Write to qpn|qp_recv_bit (31) to QP offset in UAR |
| 296 | - Device |
| 297 | - Extracts qpn from UAR |
| 298 | - Walks through the ring and does the following for each wqe |
| 299 | - Prepares the backend CQE context to be used when |
| 300 | receiving completion from backend (wr_id, op_code, emu_cq_num) |
| 301 | - For each sge prepares backend sge |
| 302 | - Calls backend's post_recv |
| 303 | |
| 304 | 4.3.4 Process backend events |
| 305 | ============================ |
| 306 | - Done by a dedicated thread used to process backend events; |
| 307 | at initialization is attached to the device and creates |
| 308 | the communication channel. |
| 309 | - Thread main loop: |
| 310 | - Polls for completions |
| 311 | - Extracts QEMU _cq_num, wr_id and op_code from context |
| 312 | - Writes CQE to CQ ring |
| 313 | - Writes CQ number to device CQ |
| 314 | - Sends completion-interrupt to guest |
| 315 | - Deallocates context |
| 316 | - Acks the event to backend |
| 317 | |
| 318 | |
| 319 | |
| 320 | 5. Limitations |
| 321 | ============== |
| 322 | - The device obviously is limited by the Guest Linux Driver features implementation |
| 323 | of the VMware device API. |
| 324 | - Memory registration mechanism requires mremap for every page in the buffer in order |
| 325 | to map it to a contiguous virtual address range. Since this is not the data path |
| 326 | it should not matter much. If the default max mr size is increased, be aware that |
| 327 | memory registration can take up to 0.5 seconds for 1GB of memory. |
| 328 | - The device requires target page size to be the same as the host page size, |
| 329 | otherwise it will fail to init. |
| 330 | - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, |
| 331 | so it can't work with huge pages. The limitation will be addressed in the future, |
| 332 | however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge |
| 333 | pages available, QEMU will use them. QEMU will fail to init if the requirements |
| 334 | are not met. |
| 335 | |
| 336 | |
| 337 | |
| 338 | 6. Performance |
| 339 | ============== |
| 340 | By design the pvrdma device exits on each post-send/receive, so for small buffers |
| 341 | the performance is affected; however for medium buffers it will became close to |
| 342 | bare metal and from 1MB buffers and up it reaches bare metal performance. |
| 343 | (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) |
| 344 | |
| 345 | All the above assumes no memory registration is done on data path. |