| Paravirtualized RDMA Device (PVRDMA) |
| ==================================== |
| |
| |
| 1. Description |
| =============== |
| PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. |
| It works with its Linux Kernel driver AS IS, no need for any special guest |
| modifications. |
| |
| While it complies with the VMware device, it can also communicate with bare |
| metal RDMA-enabled machines and does not require an RDMA HCA in the host, it |
| can work with Soft-RoCE (rxe). |
| |
| It does not require the whole guest RAM to be pinned allowing memory |
| over-commit and, even if not implemented yet, migration support will be |
| possible with some HW assistance. |
| |
| A project presentation accompany this document: |
| - http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf |
| |
| |
| |
| 2. Setup |
| ======== |
| |
| |
| 2.1 Guest setup |
| =============== |
| Fedora 27+ kernels work out of the box, older distributions |
| require updating the kernel to 4.14 to include the pvrdma driver. |
| |
| However the libpvrdma library needed by User Level Software is still |
| not available as part of the distributions, so the rdma-core library |
| needs to be compiled and optionally installed. |
| |
| Please follow the instructions at: |
| https://github.com/linux-rdma/rdma-core.git |
| |
| |
| 2.2 Host Setup |
| ============== |
| The pvrdma backend is an ibdevice interface that can be exposed |
| either by a Soft-RoCE(rxe) device on machines with no RDMA device, |
| or an HCA SRIOV function(VF/PF). |
| Note that ibdevice interfaces can't be shared between pvrdma devices, |
| each one requiring a separate instance (rxe or SRIOV VF). |
| |
| |
| 2.2.1 Soft-RoCE backend(rxe) |
| =========================== |
| A stable version of rxe is required, Fedora 27+ or a Linux |
| Kernel 4.14+ is preferred. |
| |
| The rdma_rxe module is part of the Linux Kernel but not loaded by default. |
| Install the User Level library (librxe) following the instructions from: |
| https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home |
| |
| Associate an ETH interface with rxe by running: |
| rxe_cfg add eth0 |
| An rxe0 ibdevice interface will be created and can be used as pvrdma backend. |
| |
| |
| 2.2.2 RDMA device Virtual Function backend |
| ========================================== |
| Nothing special is required, the pvrdma device can work not only with |
| Ethernet Links, but also Infinibands Links. |
| All is needed is an ibdevice with an active port, for Mellanox cards |
| will be something like mlx5_6 which can be the backend. |
| |
| |
| 2.2.3 QEMU setup |
| ================ |
| Configure QEMU with --enable-rdma flag, installing |
| the required RDMA libraries. |
| |
| |
| |
| 3. Usage |
| ======== |
| Currently the device is working only with memory backed RAM |
| and it must be mark as "shared": |
| -m 1G \ |
| -object memory-backend-ram,id=mb1,size=1G,share \ |
| -numa node,memdev=mb1 \ |
| |
| The pvrdma device is composed of two functions: |
| - Function 0 is a vmxnet Ethernet Device which is redundant in Guest |
| but is required to pass the ibdevice GID using its MAC. |
| Examples: |
| For an rxe backend using eth0 interface it will use its mac: |
| -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC> |
| For an SRIOV VF, we take the Ethernet Interface exposed by it: |
| -device vmxnet3,multifunction=on,mac=<RoCE eth MAC> |
| - Function 1 is the actual device: |
| -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port> |
| where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4) |
| Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC. |
| The rules of conversion are part of the RoCE spec, but since manual conversion |
| is not required, spotting problems is not hard: |
| Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a |
| MAC: 7c:fe:90:cb:74:3a |
| Note the difference between the first byte of the MAC and the GID. |
| |
| |
| |
| 4. Implementation details |
| ========================= |
| |
| |
| 4.1 Overview |
| ============ |
| The device acts like a proxy between the Guest Driver and the host |
| ibdevice interface. |
| On configuration path: |
| - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request |
| a resource from the backend interface, maintaining a 1-1 mapping |
| between the guest and host. |
| On data path: |
| - Every post_send/receive received from the guest will be converted into |
| a post_send/receive for the backend. The buffers data will not be touched |
| or copied resulting in near bare-metal performance for large enough buffers. |
| - Completions from the backend interface will result in completions for |
| the pvrdma device. |
| |
| |
| 4.2 PCI BARs |
| ============ |
| PCI Bars: |
| BAR 0 - MSI-X |
| MSI-X vectors: |
| (0) Command - used when execution of a command is completed. |
| (1) Async - not in use. |
| (2) Completion - used when a completion event is placed in |
| device's CQ ring. |
| BAR 1 - Registers |
| -------------------------------------------------------- |
| | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | |
| -------------------------------------------------------- |
| DSR - Address of driver/device shared memory used |
| for the command channel, used for passing: |
| - General info such as driver version |
| - Address of 'command' and 'response' |
| - Address of async ring |
| - Address of device's CQ ring |
| - Device capabilities |
| CTL - Device control operations (activate, reset etc) |
| IMG - Set interrupt mask |
| REQ - Command execution register |
| ERR - Operation status |
| |
| BAR 2 - UAR |
| --------------------------------------------------------- |
| | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | |
| --------------------------------------------------------- |
| - Offset 0 used for QP operations (send and recv) |
| - Offset 4 used for CQ operations (arm and poll) |
| |
| |
| 4.3 Major flows |
| =============== |
| |
| 4.3.1 Create CQ |
| =============== |
| - Guest driver |
| - Allocates pages for CQ ring |
| - Creates page directory (pdir) to hold CQ ring's pages |
| - Initializes CQ ring |
| - Initializes 'Create CQ' command object (cqe, pdir etc) |
| - Copies the command to 'command' address |
| - Writes 0 into REQ register |
| - Device |
| - Reads the request object from the 'command' address |
| - Allocates CQ object and initialize CQ ring based on pdir |
| - Creates the backend CQ |
| - Writes operation status to ERR register |
| - Posts command-interrupt to guest |
| - Guest driver |
| - Reads the HW response code from ERR register |
| |
| 4.3.2 Create QP |
| =============== |
| - Guest driver |
| - Allocates pages for send and receive rings |
| - Creates page directory(pdir) to hold the ring's pages |
| - Initializes 'Create QP' command object (max_send_wr, |
| send_cq_handle, recv_cq_handle, pdir etc) |
| - Copies the object to 'command' address |
| - Write 0 into REQ register |
| - Device |
| - Reads the request object from 'command' address |
| - Allocates the QP object and initialize |
| - Send and recv rings based on pdir |
| - Send and recv ring state |
| - Creates the backend QP |
| - Writes the operation status to ERR register |
| - Posts command-interrupt to guest |
| - Guest driver |
| - Reads the HW response code from ERR register |
| |
| 4.3.3 Post receive |
| ================== |
| - Guest driver |
| - Initializes a wqe and place it on recv ring |
| - Write to qpn|qp_recv_bit (31) to QP offset in UAR |
| - Device |
| - Extracts qpn from UAR |
| - Walks through the ring and does the following for each wqe |
| - Prepares the backend CQE context to be used when |
| receiving completion from backend (wr_id, op_code, emu_cq_num) |
| - For each sge prepares backend sge |
| - Calls backend's post_recv |
| |
| 4.3.4 Process backend events |
| ============================ |
| - Done by a dedicated thread used to process backend events; |
| at initialization is attached to the device and creates |
| the communication channel. |
| - Thread main loop: |
| - Polls for completions |
| - Extracts QEMU _cq_num, wr_id and op_code from context |
| - Writes CQE to CQ ring |
| - Writes CQ number to device CQ |
| - Sends completion-interrupt to guest |
| - Deallocates context |
| - Acks the event to backend |
| |
| |
| |
| 5. Limitations |
| ============== |
| - The device obviously is limited by the Guest Linux Driver features implementation |
| of the VMware device API. |
| - Memory registration mechanism requires mremap for every page in the buffer in order |
| to map it to a contiguous virtual address range. Since this is not the data path |
| it should not matter much. If the default max mr size is increased, be aware that |
| memory registration can take up to 0.5 seconds for 1GB of memory. |
| - The device requires target page size to be the same as the host page size, |
| otherwise it will fail to init. |
| - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, |
| so it can't work with huge pages. The limitation will be addressed in the future, |
| however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge |
| pages available, QEMU will use them. QEMU will fail to init if the requirements |
| are not met. |
| |
| |
| |
| 6. Performance |
| ============== |
| By design the pvrdma device exits on each post-send/receive, so for small buffers |
| the performance is affected; however for medium buffers it will became close to |
| bare metal and from 1MB buffers and up it reaches bare metal performance. |
| (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) |
| |
| All the above assumes no memory registration is done on data path. |