docs/pvrdma.txt - qemu - Git at Google

 Paravirtualized RDMA Device (PVRDMA)
 ====================================


 1. Description
 ===============
 PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
 It works with its Linux Kernel driver AS IS, no need for any special guest
 modifications.

 While it complies with the VMware device, it can also communicate with bare
 metal RDMA-enabled machines as peers.

 It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).

 It does not require the whole guest RAM to be pinned allowing memory
 over-commit and, even if not implemented yet, migration support will be
 possible with some HW assistance.

 A project presentation accompany this document:
 - http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf


 2. Setup
 ========


 2.1 Guest setup
 ===============
 Fedora 27+ kernels work out of the box, older distributions
 require updating the kernel to 4.14 to include the pvrdma driver.

 However the libpvrdma library needed by User Level Software is still
 not available as part of the distributions, so the rdma-core library
 needs to be compiled and optionally installed.

 Please follow the instructions at:
   https://github.com/linux-rdma/rdma-core.git


 2.2 Host Setup
 ==============
 The pvrdma backend is an ibdevice interface that can be exposed
 either by a Soft-RoCE(rxe) device on machines with no RDMA device,
 or an HCA SRIOV function(VF/PF).
 Note that ibdevice interfaces can't be shared between pvrdma devices,
 each one requiring a separate instance (rxe or SRIOV VF).


 2.2.1 Soft-RoCE backend(rxe)
 ===========================
 A stable version of rxe is required, Fedora 27+ or a Linux
 Kernel 4.14+ is preferred.

 The rdma_rxe module is part of the Linux Kernel but not loaded by default.
 Install the User Level library (librxe) following the instructions from:
 https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home

 Associate an ETH interface with rxe by running:
    rxe_cfg add eth0
 An rxe0 ibdevice interface will be created and can be used as pvrdma backend.


 2.2.2 RDMA device Virtual Function backend
 ==========================================
 Nothing special is required, the pvrdma device can work not only with
 Ethernet Links, but also Infinibands Links.
 All is needed is an ibdevice with an active port, for Mellanox cards
 will be something like mlx5_6 which can be the backend.


 2.2.3 QEMU setup
 ================
 Configure QEMU with --enable-rdma flag, installing
 the required RDMA libraries.


 3. Usage
 ========


 3.1 VM Memory settings
 ======================
 Currently the device is working only with memory backed RAM
 and it must be mark as "shared":
    -m 1G \
    -object memory-backend-ram,id=mb1,size=1G,share \
    -numa node,memdev=mb1 \


 3.2 MAD Multiplexer
 ===================
 MAD Multiplexer is a service that exposes MAD-like interface for VMs in
 order to overcome the limitation where only single entity can register with
 MAD layer to send and receive RDMA-CM MAD packets.

 To build rdmacm-mux run
 # make rdmacm-mux

 The application accepts 3 command line arguments and exposes a UNIX socket
 to pass control and data to it.
 -d rdma-device-name  Name of RDMA device to register with
 -s unix-socket-path  Path to unix socket to listen (default /var/run/rdmacm-mux)
 -p rdma-device-port  Port number of RDMA device to register with (default 1)
 The final UNIX socket file name is a concatenation of the 3 arguments so
 for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
 will be created.

 pvrdma requires this service.

 Please refer to contrib/rdmacm-mux for more details.


 3.3 Service exposed by libvirt daemon
 =====================================
 The control over the RDMA device's GID table is done by updating the
 device's Ethernet function addresses.
 Usually the first GID entry is determined by the MAC address, the second by
 the first IPv6 address and the third by the IPv4 address. Other entries can
 be added by adding more IP addresses. The opposite is the same, i.e.
 whenever an address is removed, the corresponding GID entry is removed.
 The process is done by the network and RDMA stacks. Whenever an address is
 added the ib_core driver is notified and calls the device driver add_gid
 function which in turn update the device.
 To support this in pvrdma device the device hooks into the create_bind and
 destroy_bind HW commands triggered by pvrdma driver in guest.

 Whenever changed is made to the pvrdma port's GID table a special QMP
 messages is sent to be processed by libvirt to update the address of the
 backend Ethernet device.

 pvrdma requires that libvirt service will be up.


 3.4 PCI devices settings
 ========================
 RoCE device exposes two functions - an Ethernet and RDMA.
 To support it, pvrdma device is composed of two PCI functions, an Ethernet
 device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
 Ethernet function can be used for other Ethernet purposes such as IP.


 3.5 Device parameters
 =====================
 - netdev: Specifies the Ethernet device function name on the host for
   example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
   device used to create it.
 - ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
 - mad-chardev: The name of the MAD multiplexer char device.
 - ibport: In case of multi-port device (such as Mellanox's HCA) this
   specify the port to use. If not set 1 will be used.
 - dev-caps-max-mr-size: The maximum size of MR.
 - dev-caps-max-qp:      Maximum number of QPs.
 - dev-caps-max-sge:     Maximum number of SGE elements in WR.
 - dev-caps-max-cq:      Maximum number of CQs.
 - dev-caps-max-mr:      Maximum number of MRs.
 - dev-caps-max-pd:      Maximum number of PDs.
 - dev-caps-max-ah:      Maximum number of AHs.

 Notes:
 - The first 3 parameters are mandatory settings, the rest have their
   defaults.
 - The last 8 parameters (the ones that prefixed by dev-caps) defines the top
   limits but the final values is adjusted by the backend device limitations.
 - netdev can be extracted from ibdev's sysfs
   (/sys/class/infiniband/<ibdev>/device/net/)


 3.6 Example
 ===========
 Define bridge device with vmxnet3 network backend:
 <interface type='bridge'>
   <mac address='56:b4:44:e9:62:dc'/>
   <source bridge='bridge1'/>
   <model type='vmxnet3'/>
   <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
 </interface>

 Define pvrdma device:
 <qemu:commandline>
   <qemu:arg value='-object'/>
   <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
   <qemu:arg value='-numa'/>
   <qemu:arg value='node,memdev=mb1'/>
   <qemu:arg value='-chardev'/>
   <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
   <qemu:arg value='-device'/>
   <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
 </qemu:commandline>


 4. Implementation details
 =========================


 4.1 Overview
 ============
 The device acts like a proxy between the Guest Driver and the host
 ibdevice interface.
 On configuration path:
  - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
    a resource from the backend interface, maintaining a 1-1 mapping
    between the guest and host.
 On data path:
  - Every post_send/receive received from the guest will be converted into
    a post_send/receive for the backend. The buffers data will not be touched
    or copied resulting in near bare-metal performance for large enough buffers.
  - Completions from the backend interface will result in completions for
    the pvrdma device.


 4.2 PCI BARs
 ============
 PCI Bars:
 	BAR 0 - MSI-X
         MSI-X vectors:
 		(0) Command - used when execution of a command is completed.
 		(1) Async - not in use.
 		(2) Completion - used when a completion event is placed in
 		  device's CQ ring.
 	BAR 1 - Registers
         --------------------------------------------------------
         | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC |
         --------------------------------------------------------
 		DSR - Address of driver/device shared memory used
               for the command channel, used for passing:
 			    - General info such as driver version
 			    - Address of 'command' and 'response'
 			    - Address of async ring
 			    - Address of device's CQ ring
 			    - Device capabilities
 		CTL - Device control operations (activate, reset etc)
 		IMG - Set interrupt mask
 		REQ - Command execution register
 		ERR - Operation status

 	BAR 2 - UAR
         ---------------------------------------------------------
         | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag |
         ---------------------------------------------------------
 		- Offset 0 used for QP operations (send and recv)
 		- Offset 4 used for CQ operations (arm and poll)


 4.3 Major flows
 ===============

 4.3.1 Create CQ
 ===============
     - Guest driver
         - Allocates pages for CQ ring
         - Creates page directory (pdir) to hold CQ ring's pages
         - Initializes CQ ring
         - Initializes 'Create CQ' command object (cqe, pdir etc)
         - Copies the command to 'command' address
         - Writes 0 into REQ register
     - Device
         - Reads the request object from the 'command' address
         - Allocates CQ object and initialize CQ ring based on pdir
         - Creates the backend CQ
         - Writes operation status to ERR register
         - Posts command-interrupt to guest
     - Guest driver
         - Reads the HW response code from ERR register

 4.3.2 Create QP
 ===============
     - Guest driver
         - Allocates pages for send and receive rings
         - Creates page directory(pdir) to hold the ring's pages
         - Initializes 'Create QP' command object (max_send_wr,
           send_cq_handle, recv_cq_handle, pdir etc)
         - Copies the object to 'command' address
         - Write 0 into REQ register
     - Device
         - Reads the request object from 'command' address
         - Allocates the QP object and initialize
             - Send and recv rings based on pdir
             - Send and recv ring state
         - Creates the backend QP
         - Writes the operation status to ERR register
         - Posts command-interrupt to guest
     - Guest driver
         - Reads the HW response code from ERR register

 4.3.3 Post receive
 ==================
     - Guest driver
         - Initializes a wqe and place it on recv ring
         - Write to qpn|qp_recv_bit (31) to QP offset in UAR
     - Device
         - Extracts qpn from UAR
         - Walks through the ring and does the following for each wqe
             - Prepares the backend CQE context to be used when
               receiving completion from backend (wr_id, op_code, emu_cq_num)
             - For each sge prepares backend sge
             - Calls backend's post_recv

 4.3.4 Process backend events
 ============================
     - Done by a dedicated thread used to process backend events;
       at initialization is attached to the device and creates
       the communication channel.
     - Thread main loop:
         - Polls for completions
         - Extracts QEMU _cq_num, wr_id and op_code from context
         - Writes CQE to CQ ring
         - Writes CQ number to device CQ
         - Sends completion-interrupt to guest
         - Deallocates context
         - Acks the event to backend


 5. Limitations
 ==============
 - The device obviously is limited by the Guest Linux Driver features implementation
   of the VMware device API.
 - Memory registration mechanism requires mremap for every page in the buffer in order
   to map it to a contiguous virtual address range. Since this is not the data path
   it should not matter much. If the default max mr size is increased, be aware that
   memory registration can take up to 0.5 seconds for 1GB of memory.
 - The device requires target page size to be the same as the host page size,
   otherwise it will fail to init.
 - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
   so it can't work with huge pages. The limitation will be addressed in the future,
   however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
   pages available, QEMU will use them. QEMU will fail to init if the requirements
   are not met.


 6. Performance
 ==============
 By design the pvrdma device exits on each post-send/receive, so for small buffers
 the performance is affected; however for medium buffers it will became close to
 bare metal and from 1MB buffers and  up it reaches bare metal performance.
 (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)

 All the above assumes no memory registration is done on data path.
	Paravirtualized RDMA Device (PVRDMA)
	====================================


	1. Description
	===============
	PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
	It works with its Linux Kernel driver AS IS, no need for any special guest
	modifications.

	While it complies with the VMware device, it can also communicate with bare
	metal RDMA-enabled machines as peers.

	It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).

	It does not require the whole guest RAM to be pinned allowing memory
	over-commit and, even if not implemented yet, migration support will be
	possible with some HW assistance.

	A project presentation accompany this document:
	- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf



	2. Setup
	========


	2.1 Guest setup
	===============
	Fedora 27+ kernels work out of the box, older distributions
	require updating the kernel to 4.14 to include the pvrdma driver.

	However the libpvrdma library needed by User Level Software is still
	not available as part of the distributions, so the rdma-core library
	needs to be compiled and optionally installed.

	Please follow the instructions at:
	https://github.com/linux-rdma/rdma-core.git


	2.2 Host Setup
	==============
	The pvrdma backend is an ibdevice interface that can be exposed
	either by a Soft-RoCE(rxe) device on machines with no RDMA device,
	or an HCA SRIOV function(VF/PF).
	Note that ibdevice interfaces can't be shared between pvrdma devices,
	each one requiring a separate instance (rxe or SRIOV VF).


	2.2.1 Soft-RoCE backend(rxe)
	===========================
	A stable version of rxe is required, Fedora 27+ or a Linux
	Kernel 4.14+ is preferred.

	The rdma_rxe module is part of the Linux Kernel but not loaded by default.
	Install the User Level library (librxe) following the instructions from:
	https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home

	Associate an ETH interface with rxe by running:
	rxe_cfg add eth0
	An rxe0 ibdevice interface will be created and can be used as pvrdma backend.


	2.2.2 RDMA device Virtual Function backend
	==========================================
	Nothing special is required, the pvrdma device can work not only with
	Ethernet Links, but also Infinibands Links.
	All is needed is an ibdevice with an active port, for Mellanox cards
	will be something like mlx5_6 which can be the backend.


	2.2.3 QEMU setup
	================
	Configure QEMU with --enable-rdma flag, installing
	the required RDMA libraries.



	3. Usage
	========


	3.1 VM Memory settings
	======================
	Currently the device is working only with memory backed RAM
	and it must be mark as "shared":
	-m 1G \
	-object memory-backend-ram,id=mb1,size=1G,share \
	-numa node,memdev=mb1 \


	3.2 MAD Multiplexer
	===================
	MAD Multiplexer is a service that exposes MAD-like interface for VMs in
	order to overcome the limitation where only single entity can register with
	MAD layer to send and receive RDMA-CM MAD packets.

	To build rdmacm-mux run
	# make rdmacm-mux

	The application accepts 3 command line arguments and exposes a UNIX socket
	to pass control and data to it.
	-d rdma-device-name Name of RDMA device to register with
	-s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux)
	-p rdma-device-port Port number of RDMA device to register with (default 1)
	The final UNIX socket file name is a concatenation of the 3 arguments so
	for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
	will be created.

	pvrdma requires this service.

	Please refer to contrib/rdmacm-mux for more details.


	3.3 Service exposed by libvirt daemon
	=====================================
	The control over the RDMA device's GID table is done by updating the
	device's Ethernet function addresses.
	Usually the first GID entry is determined by the MAC address, the second by
	the first IPv6 address and the third by the IPv4 address. Other entries can
	be added by adding more IP addresses. The opposite is the same, i.e.
	whenever an address is removed, the corresponding GID entry is removed.
	The process is done by the network and RDMA stacks. Whenever an address is
	added the ib_core driver is notified and calls the device driver add_gid
	function which in turn update the device.
	To support this in pvrdma device the device hooks into the create_bind and
	destroy_bind HW commands triggered by pvrdma driver in guest.

	Whenever changed is made to the pvrdma port's GID table a special QMP
	messages is sent to be processed by libvirt to update the address of the
	backend Ethernet device.

	pvrdma requires that libvirt service will be up.


	3.4 PCI devices settings
	========================
	RoCE device exposes two functions - an Ethernet and RDMA.
	To support it, pvrdma device is composed of two PCI functions, an Ethernet
	device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
	Ethernet function can be used for other Ethernet purposes such as IP.


	3.5 Device parameters
	=====================
	- netdev: Specifies the Ethernet device function name on the host for
	example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
	device used to create it.
	- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
	- mad-chardev: The name of the MAD multiplexer char device.
	- ibport: In case of multi-port device (such as Mellanox's HCA) this
	specify the port to use. If not set 1 will be used.
	- dev-caps-max-mr-size: The maximum size of MR.
	- dev-caps-max-qp: Maximum number of QPs.
	- dev-caps-max-sge: Maximum number of SGE elements in WR.
	- dev-caps-max-cq: Maximum number of CQs.
	- dev-caps-max-mr: Maximum number of MRs.
	- dev-caps-max-pd: Maximum number of PDs.
	- dev-caps-max-ah: Maximum number of AHs.

	Notes:
	- The first 3 parameters are mandatory settings, the rest have their
	defaults.
	- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
	limits but the final values is adjusted by the backend device limitations.
	- netdev can be extracted from ibdev's sysfs
	(/sys/class/infiniband/<ibdev>/device/net/)


	3.6 Example
	===========
	Define bridge device with vmxnet3 network backend:
	<interface type='bridge'>
	<mac address='56:b4:44:e9:62:dc'/>
	<source bridge='bridge1'/>
	<model type='vmxnet3'/>
	<address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
	</interface>

	Define pvrdma device:
	<qemu:commandline>
	<qemu:arg value='-object'/>
	<qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
	<qemu:arg value='-numa'/>
	<qemu:arg value='node,memdev=mb1'/>
	<qemu:arg value='-chardev'/>
	<qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
	<qemu:arg value='-device'/>
	<qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
	</qemu:commandline>



	4. Implementation details
	=========================


	4.1 Overview
	============
	The device acts like a proxy between the Guest Driver and the host
	ibdevice interface.
	On configuration path:
	- For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
	a resource from the backend interface, maintaining a 1-1 mapping
	between the guest and host.
	On data path:
	- Every post_send/receive received from the guest will be converted into
	a post_send/receive for the backend. The buffers data will not be touched
	or copied resulting in near bare-metal performance for large enough buffers.
	- Completions from the backend interface will result in completions for
	the pvrdma device.


	4.2 PCI BARs
	============
	PCI Bars:
	BAR 0 - MSI-X
	MSI-X vectors:
	(0) Command - used when execution of a command is completed.
	(1) Async - not in use.
	(2) Completion - used when a completion event is placed in
	device's CQ ring.
	BAR 1 - Registers
	--------------------------------------------------------
	\| VERSION \| DSR \| CTL \| REQ \| ERR \| ICR \| IMR \| MAC \|
	--------------------------------------------------------
	DSR - Address of driver/device shared memory used
	for the command channel, used for passing:
	- General info such as driver version
	- Address of 'command' and 'response'
	- Address of async ring
	- Address of device's CQ ring
	- Device capabilities
	CTL - Device control operations (activate, reset etc)
	IMG - Set interrupt mask
	REQ - Command execution register
	ERR - Operation status

	BAR 2 - UAR
	---------------------------------------------------------
	\| QP_NUM \| SEND/RECV Flag \|\| CQ_NUM \| ARM/POLL Flag \|
	---------------------------------------------------------
	- Offset 0 used for QP operations (send and recv)
	- Offset 4 used for CQ operations (arm and poll)


	4.3 Major flows
	===============

	4.3.1 Create CQ
	===============
	- Guest driver
	- Allocates pages for CQ ring
	- Creates page directory (pdir) to hold CQ ring's pages
	- Initializes CQ ring
	- Initializes 'Create CQ' command object (cqe, pdir etc)
	- Copies the command to 'command' address
	- Writes 0 into REQ register
	- Device
	- Reads the request object from the 'command' address
	- Allocates CQ object and initialize CQ ring based on pdir
	- Creates the backend CQ
	- Writes operation status to ERR register
	- Posts command-interrupt to guest
	- Guest driver
	- Reads the HW response code from ERR register

	4.3.2 Create QP
	===============
	- Guest driver
	- Allocates pages for send and receive rings
	- Creates page directory(pdir) to hold the ring's pages
	- Initializes 'Create QP' command object (max_send_wr,
	send_cq_handle, recv_cq_handle, pdir etc)
	- Copies the object to 'command' address
	- Write 0 into REQ register
	- Device
	- Reads the request object from 'command' address
	- Allocates the QP object and initialize
	- Send and recv rings based on pdir
	- Send and recv ring state
	- Creates the backend QP
	- Writes the operation status to ERR register
	- Posts command-interrupt to guest
	- Guest driver
	- Reads the HW response code from ERR register

	4.3.3 Post receive
	==================
	- Guest driver
	- Initializes a wqe and place it on recv ring
	- Write to qpn\|qp_recv_bit (31) to QP offset in UAR
	- Device
	- Extracts qpn from UAR
	- Walks through the ring and does the following for each wqe
	- Prepares the backend CQE context to be used when
	receiving completion from backend (wr_id, op_code, emu_cq_num)
	- For each sge prepares backend sge
	- Calls backend's post_recv

	4.3.4 Process backend events
	============================
	- Done by a dedicated thread used to process backend events;
	at initialization is attached to the device and creates
	the communication channel.
	- Thread main loop:
	- Polls for completions
	- Extracts QEMU _cq_num, wr_id and op_code from context
	- Writes CQE to CQ ring
	- Writes CQ number to device CQ
	- Sends completion-interrupt to guest
	- Deallocates context
	- Acks the event to backend



	5. Limitations
	==============
	- The device obviously is limited by the Guest Linux Driver features implementation
	of the VMware device API.
	- Memory registration mechanism requires mremap for every page in the buffer in order
	to map it to a contiguous virtual address range. Since this is not the data path
	it should not matter much. If the default max mr size is increased, be aware that
	memory registration can take up to 0.5 seconds for 1GB of memory.
	- The device requires target page size to be the same as the host page size,
	otherwise it will fail to init.
	- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
	so it can't work with huge pages. The limitation will be addressed in the future,
	however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
	pages available, QEMU will use them. QEMU will fail to init if the requirements
	are not met.



	6. Performance
	==============
	By design the pvrdma device exits on each post-send/receive, so for small buffers
	the performance is affected; however for medium buffers it will became close to
	bare metal and from 1MB buffers and up it reaches bare metal performance.
	(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)

	All the above assumes no memory registration is done on data path.