docs/rdma.txt - qemu - Git at Google

 (RDMA: Remote Direct Memory Access)
 RDMA Live Migration Specification, Version # 1
 ==============================================
 Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
 Github: git@github.com:hinesmr/qemu.git, 'rdma' branch

 Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>

 An *exhaustive* paper (2010) shows additional performance details
 linked on the QEMU wiki above.

 Contents:
 =========
 * Introduction
 * Before running
 * Running
 * Performance
 * RDMA Migration Protocol Description
 * Versioning and Capabilities
 * QEMUFileRDMA Interface
 * Migration of VM's ram
 * Error handling
 * TODO

 Introduction:
 =============

 RDMA helps make your migration more deterministic under heavy load because
 of the significantly lower latency and higher throughput over TCP/IP. This is
 because the RDMA I/O architecture reduces the number of interrupts and
 data copies by bypassing the host networking stack. In particular, a TCP-based
 migration, under certain types of memory-bound workloads, may take a more
 unpredictable amount of time to complete the migration if the amount of
 memory tracked during each live migration iteration round cannot keep pace
 with the rate of dirty memory produced by the workload.

 RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
 over Converged Ethernet) as well as Infiniband-based. This implementation of
 migration using RDMA is capable of using both technologies because of
 the use of the OpenFabrics OFED software stack that abstracts out the
 programming model irrespective of the underlying hardware.

 Refer to openfabrics.org or your respective RDMA hardware vendor for
 an understanding on how to verify that you have the OFED software stack
 installed in your environment. You should be able to successfully link
 against the "librdmacm" and "libibverbs" libraries and development headers
 for a working build of QEMU to run successfully using RDMA Migration.

 BEFORE RUNNING:
 ===============

 Use of RDMA during migration requires pinning and registering memory
 with the hardware. This means that memory must be physically resident
 before the hardware can transmit that memory to another machine.
 If this is not acceptable for your application or product, then the use
 of RDMA migration may in fact be harmful to co-located VMs or other
 software on the machine if there is not sufficient memory available to
 relocate the entire footprint of the virtual machine. If so, then the
 use of RDMA is discouraged and it is recommended to use standard TCP migration.

 Experimental: Next, decide if you want dynamic page registration.
 For example, if you have an 8GB RAM virtual machine, but only 1GB
 is in active use, then enabling this feature will cause all 8GB to
 be pinned and resident in memory. This feature mostly affects the
 bulk-phase round of the migration and can be enabled for extremely
 high-performance RDMA hardware using the following command:

 QEMU Monitor Command:
 $ migrate_set_capability rdma-pin-all on # disabled by default

 Performing this action will cause all 8GB to be pinned, so if that's
 not what you want, then please ignore this step altogether.

 On the other hand, this will also significantly speed up the bulk round
 of the migration, which can greatly reduce the "total" time of your migration.
 Example performance of this using an idle VM in the previous example
 can be found in the "Performance" section.

 Note: for very large virtual machines (hundreds of GBs), pinning all
 *all* of the memory of your virtual machine in the kernel is very expensive
 may extend the initial bulk iteration time by many seconds,
 and thus extending the total migration time. However, this will not
 affect the determinism or predictability of your migration you will
 still gain from the benefits of advanced pinning with RDMA.

 RUNNING:
 ========

 First, set the migration speed to match your hardware's capabilities:

 QEMU Monitor Command:
 $ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device

 Next, on the destination machine, add the following to the QEMU command line:

 qemu ..... -incoming rdma:host:port

 Finally, perform the actual migration on the source machine:

 QEMU Monitor Command:
 $ migrate -d rdma:host:port

 PERFORMANCE
 ===========

 Here is a brief summary of total migration time and downtime using RDMA:
 Using a 40gbps infiniband link performing a worst-case stress test,
 using an 8GB RAM virtual machine:

 Using the following command:
 $ apt-get install stress
 $ stress --vm-bytes 7500M --vm 1 --vm-keep

 1. Migration throughput: 26 gigabits/second.
 2. Downtime (stop time) varies between 15 and 100 milliseconds.

 EFFECTS of memory registration on bulk phase round:

 For example, in the same 8GB RAM example with all 8GB of memory in
 active use and the VM itself is completely idle using the same 40 gbps
 infiniband link:

 1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
 2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps

 These numbers would of course scale up to whatever size virtual machine
 you have to migrate using RDMA.

 Enabling this feature does *not* have any measurable affect on
 migration *downtime*. This is because, without this feature, all of the
 memory will have already been registered already in advance during
 the bulk round and does not need to be re-registered during the successive
 iteration rounds.

 RDMA Protocol Description:
 ==========================

 Migration with RDMA is separated into two parts:

 1. The transmission of the pages using RDMA
 2. Everything else (a control channel is introduced)

 "Everything else" is transmitted using a formal
 protocol now, consisting of infiniband SEND messages.

 An infiniband SEND message is the standard ibverbs
 message used by applications of infiniband hardware.
 The only difference between a SEND message and an RDMA
 message is that SEND messages cause notifications
 to be posted to the completion queue (CQ) on the
 infiniband receiver side, whereas RDMA messages (used
 for VM's ram) do not (to behave like an actual DMA).

 Messages in infiniband require two things:

 1. registration of the memory that will be transmitted
 2. (SEND only) work requests to be posted on both
    sides of the network before the actual transmission
    can occur.

 RDMA messages are much easier to deal with. Once the memory
 on the receiver side is registered and pinned, we're
 basically done. All that is required is for the sender
 side to start dumping bytes onto the link.

 (Memory is not released from pinning until the migration
 completes, given that RDMA migrations are very fast.)

 SEND messages require more coordination because the
 receiver must have reserved space (using a receive
 work request) on the receive queue (RQ) before QEMUFileRDMA
 can start using them to carry all the bytes as
 a control transport for migration of device state.

 To begin the migration, the initial connection setup is
 as follows (migration-rdma.c):

 1. Receiver and Sender are started (command line or libvirt):
 2. Both sides post two RQ work requests
 3. Receiver does listen()
 4. Sender does connect()
 5. Receiver accept()
 6. Check versioning and capabilities (described later)

 At this point, we define a control channel on top of SEND messages
 which is described by a formal protocol. Each SEND message has a
 header portion and a data portion (but together are transmitted
 as a single SEND message).

 Header:
     * Length               (of the data portion, uint32, network byte order)
     * Type                 (what command to perform, uint32, network byte order)
     * Repeat               (Number of commands in data portion, same type only)

 The 'Repeat' field is here to support future multiple page registrations
 in a single message without any need to change the protocol itself
 so that the protocol is compatible against multiple versions of QEMU.
 Version #1 requires that all server implementations of the protocol must
 check this field and register all requests found in the array of commands located
 in the data portion and return an equal number of results in the response.
 The maximum number of repeats is hard-coded to 4096. This is a conservative
 limit based on the maximum size of a SEND message along with empirical
 observations on the maximum future benefit of simultaneous page registrations.

 The 'type' field has 12 different command values:
      1. Unused
      2. Error                      (sent to the source during bad things)
      3. Ready                      (control-channel is available)
      4. QEMU File                  (for sending non-live device state)
      5. RAM Blocks request         (used right after connection setup)
      6. RAM Blocks result          (used right after connection setup)
      7. Compress page              (zap zero page and skip registration)
      8. Register request           (dynamic chunk registration)
      9. Register result            ('rkey' to be used by sender)
     10. Register finished          (registration for current iteration finished)
     11. Unregister request         (unpin previously registered memory)
     12. Unregister finished        (confirmation that unpin completed)

 A single control message, as hinted above, can contain within the data
 portion an array of many commands of the same type. If there is more than
 one command, then the 'repeat' field will be greater than 1.

 After connection setup, message 5 & 6 are used to exchange ram block
 information and optionally pin all the memory if requested by the user.

 After ram block exchange is completed, we have two protocol-level
 functions, responsible for communicating control-channel commands
 using the above list of values:

 Logically:

 qemu_rdma_exchange_recv(header, expected command type)

 1. We transmit a READY command to let the sender know that
    we are *ready* to receive some data bytes on the control channel.
 2. Before attempting to receive the expected command, we post another
    RQ work request to replace the one we just used up.
 3. Block on a CQ event channel and wait for the SEND to arrive.
 4. When the send arrives, librdmacm will unblock us.
 5. Verify that the command-type and version received matches the one we expected.

 qemu_rdma_exchange_send(header, data, optional response header & data):

 1. Block on the CQ event channel waiting for a READY command
    from the receiver to tell us that the receiver
    is *ready* for us to transmit some new bytes.
 2. Optionally: if we are expecting a response from the command
    (that we have not yet transmitted), let's post an RQ
    work request to receive that data a few moments later.
 3. When the READY arrives, librdmacm will
    unblock us and we immediately post a RQ work request
    to replace the one we just used up.
 4. Now, we can actually post the work request to SEND
    the requested command type of the header we were asked for.
 5. Optionally, if we are expecting a response (as before),
    we block again and wait for that response using the additional
    work request we previously posted. (This is used to carry
    'Register result' commands #6 back to the sender which
    hold the rkey need to perform RDMA. Note that the virtual address
    corresponding to this rkey was already exchanged at the beginning
    of the connection (described below).

 All of the remaining command types (not including 'ready')
 described above all use the aforementioned two functions to do the hard work:

 1. After connection setup, RAMBlock information is exchanged using
    this protocol before the actual migration begins. This information includes
    a description of each RAMBlock on the server side as well as the virtual addresses
    and lengths of each RAMBlock. This is used by the client to determine the
    start and stop locations of chunks and how to register them dynamically
    before performing the RDMA operations.
 2. During runtime, once a 'chunk' becomes full of pages ready to
    be sent with RDMA, the registration commands are used to ask the
    other side to register the memory for this chunk and respond
    with the result (rkey) of the registration.
 3. Also, the QEMUFile interfaces also call these functions (described below)
    when transmitting non-live state, such as devices or to send
    its own protocol information during the migration process.
 4. Finally, zero pages are only checked if a page has not yet been registered
    using chunk registration (or not checked at all and unconditionally
    written if chunk registration is disabled. This is accomplished using
    the "Compress" command listed above. If the page *has* been registered
    then we check the entire chunk for zero. Only if the entire chunk is
    zero, then we send a compress command to zap the page on the other side.

 Versioning and Capabilities
 ===========================
 Current version of the protocol is version #1.

 The same version applies to both for protocol traffic and capabilities
 negotiation. (i.e. There is only one version number that is referred to
 by all communication).

 librdmacm provides the user with a 'private data' area to be exchanged
 at connection-setup time before any infiniband traffic is generated.

 Header:
     * Version (protocol version validated before send/recv occurs),
                                                uint32, network byte order
     * Flags   (bitwise OR of each capability),
                                                uint32, network byte order

 There is no data portion of this header right now, so there is
 no length field. The maximum size of the 'private data' section
 is only 192 bytes per the Infiniband specification, so it's not
 very useful for data anyway. This structure needs to remain small.

 This private data area is a convenient place to check for protocol
 versioning because the user does not need to register memory to
 transmit a few bytes of version information.

 This is also a convenient place to negotiate capabilities
 (like dynamic page registration).

 If the version is invalid, we throw an error.

 If the version is new, we only negotiate the capabilities that the
 requested version is able to perform and ignore the rest.

 Currently there is only one capability in Version #1: dynamic page registration

 Finally: Negotiation happens with the Flags field: If the primary-VM
 sets a flag, but the destination does not support this capability, it
 will return a zero-bit for that flag and the primary-VM will understand
 that as not being an available capability and will thus disable that
 capability on the primary-VM side.

 QEMUFileRDMA Interface:
 =======================

 QEMUFileRDMA introduces a couple of new functions:

 1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
 2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)

 These two functions are very short and simply use the protocol
 describe above to deliver bytes without changing the upper-level
 users of QEMUFile that depend on a bytestream abstraction.

 Finally, how do we handoff the actual bytes to get_buffer()?

 Again, because we're trying to "fake" a bytestream abstraction
 using an analogy not unlike individual UDP frames, we have
 to hold on to the bytes received from control-channel's SEND
 messages in memory.

 Each time we receive a complete "QEMU File" control-channel
 message, the bytes from SEND are copied into a small local holding area.

 Then, we return the number of bytes requested by get_buffer()
 and leave the remaining bytes in the holding area until get_buffer()
 comes around for another pass.

 If the buffer is empty, then we follow the same steps
 listed above and issue another "QEMU File" protocol command,
 asking for a new SEND message to re-fill the buffer.

 Migration of VM's ram:
 ====================

 At the beginning of the migration, (migration-rdma.c),
 the sender and the receiver populate the list of RAMBlocks
 to be registered with each other into a structure.
 Then, using the aforementioned protocol, they exchange a
 description of these blocks with each other, to be used later
 during the iteration of main memory. This description includes
 a list of all the RAMBlocks, their offsets and lengths, virtual
 addresses and possibly includes pre-registered RDMA keys in case dynamic
 page registration was disabled on the server-side, otherwise not.

 Main memory is not migrated with the aforementioned protocol,
 but is instead migrated with normal RDMA Write operations.

 Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
 Chunk size is not dynamic, but it could be in a future implementation.
 There's nothing to indicate that this is useful right now.

 When a chunk is full (or a flush() occurs), the memory backed by
 the chunk is registered with librdmacm is pinned in memory on
 both sides using the aforementioned protocol.
 After pinning, an RDMA Write is generated and transmitted
 for the entire chunk.

 Chunks are also transmitted in batches: This means that we
 do not request that the hardware signal the completion queue
 for the completion of *every* chunk. The current batch size
 is about 64 chunks (corresponding to 64 MB of memory).
 Only the last chunk in a batch must be signaled.
 This helps keep everything as asynchronous as possible
 and helps keep the hardware busy performing RDMA operations.

 Error-handling:
 ===============

 Infiniband has what is called a "Reliable, Connected"
 link (one of 4 choices). This is the mode in which
 we use for RDMA migration.

 If a *single* message fails,
 the decision is to abort the migration entirely and
 cleanup all the RDMA descriptors and unregister all
 the memory.

 After cleanup, the Virtual Machine is returned to normal
 operation the same way that would happen if the TCP
 socket is broken during a non-RDMA based migration.

 TODO:
 =====
 1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
    are not compatible with infiniband memory pinning and will result in
    an aborted migration (but with the source VM left unaffected).
 2. Use of the recent /proc/<pid>/pagemap would likely speed up
    the use of KSM and ballooning while using RDMA.
 3. Also, some form of balloon-device usage tracking would also
    help alleviate some issues.
 4. Use LRU to provide more fine-grained direction of UNREGISTER
    requests for unpinning memory in an overcommitted environment.
 5. Expose UNREGISTER support to the user by way of workload-specific
    hints about application behavior.
	(RDMA: Remote Direct Memory Access)
	RDMA Live Migration Specification, Version # 1
	==============================================
	Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
	Github: git@github.com:hinesmr/qemu.git, 'rdma' branch

	Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>

	An exhaustive paper (2010) shows additional performance details
	linked on the QEMU wiki above.

	Contents:
	=========
	* Introduction
	* Before running
	* Running
	* Performance
	* RDMA Migration Protocol Description
	* Versioning and Capabilities
	* QEMUFileRDMA Interface
	* Migration of VM's ram
	* Error handling
	* TODO

	Introduction:
	=============

	RDMA helps make your migration more deterministic under heavy load because
	of the significantly lower latency and higher throughput over TCP/IP. This is
	because the RDMA I/O architecture reduces the number of interrupts and
	data copies by bypassing the host networking stack. In particular, a TCP-based
	migration, under certain types of memory-bound workloads, may take a more
	unpredictable amount of time to complete the migration if the amount of
	memory tracked during each live migration iteration round cannot keep pace
	with the rate of dirty memory produced by the workload.

	RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
	over Converged Ethernet) as well as Infiniband-based. This implementation of
	migration using RDMA is capable of using both technologies because of
	the use of the OpenFabrics OFED software stack that abstracts out the
	programming model irrespective of the underlying hardware.

	Refer to openfabrics.org or your respective RDMA hardware vendor for
	an understanding on how to verify that you have the OFED software stack
	installed in your environment. You should be able to successfully link
	against the "librdmacm" and "libibverbs" libraries and development headers
	for a working build of QEMU to run successfully using RDMA Migration.

	BEFORE RUNNING:
	===============

	Use of RDMA during migration requires pinning and registering memory
	with the hardware. This means that memory must be physically resident
	before the hardware can transmit that memory to another machine.
	If this is not acceptable for your application or product, then the use
	of RDMA migration may in fact be harmful to co-located VMs or other
	software on the machine if there is not sufficient memory available to
	relocate the entire footprint of the virtual machine. If so, then the
	use of RDMA is discouraged and it is recommended to use standard TCP migration.

	Experimental: Next, decide if you want dynamic page registration.
	For example, if you have an 8GB RAM virtual machine, but only 1GB
	is in active use, then enabling this feature will cause all 8GB to
	be pinned and resident in memory. This feature mostly affects the
	bulk-phase round of the migration and can be enabled for extremely
	high-performance RDMA hardware using the following command:

	QEMU Monitor Command:
	$ migrate_set_capability rdma-pin-all on # disabled by default

	Performing this action will cause all 8GB to be pinned, so if that's
	not what you want, then please ignore this step altogether.

	On the other hand, this will also significantly speed up the bulk round
	of the migration, which can greatly reduce the "total" time of your migration.
	Example performance of this using an idle VM in the previous example
	can be found in the "Performance" section.

	Note: for very large virtual machines (hundreds of GBs), pinning all
	all of the memory of your virtual machine in the kernel is very expensive
	may extend the initial bulk iteration time by many seconds,
	and thus extending the total migration time. However, this will not
	affect the determinism or predictability of your migration you will
	still gain from the benefits of advanced pinning with RDMA.

	RUNNING:
	========

	First, set the migration speed to match your hardware's capabilities:

	QEMU Monitor Command:
	$ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device

	Next, on the destination machine, add the following to the QEMU command line:

	qemu ..... -incoming rdma:host:port

	Finally, perform the actual migration on the source machine:

	QEMU Monitor Command:
	$ migrate -d rdma:host:port

	PERFORMANCE
	===========

	Here is a brief summary of total migration time and downtime using RDMA:
	Using a 40gbps infiniband link performing a worst-case stress test,
	using an 8GB RAM virtual machine:

	Using the following command:
	$ apt-get install stress
	$ stress --vm-bytes 7500M --vm 1 --vm-keep

	1. Migration throughput: 26 gigabits/second.
	2. Downtime (stop time) varies between 15 and 100 milliseconds.

	EFFECTS of memory registration on bulk phase round:

	For example, in the same 8GB RAM example with all 8GB of memory in
	active use and the VM itself is completely idle using the same 40 gbps
	infiniband link:

	1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
	2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps

	These numbers would of course scale up to whatever size virtual machine
	you have to migrate using RDMA.

	Enabling this feature does not have any measurable affect on
	migration downtime. This is because, without this feature, all of the
	memory will have already been registered already in advance during
	the bulk round and does not need to be re-registered during the successive
	iteration rounds.

	RDMA Protocol Description:
	==========================

	Migration with RDMA is separated into two parts:

	1. The transmission of the pages using RDMA
	2. Everything else (a control channel is introduced)

	"Everything else" is transmitted using a formal
	protocol now, consisting of infiniband SEND messages.

	An infiniband SEND message is the standard ibverbs
	message used by applications of infiniband hardware.
	The only difference between a SEND message and an RDMA
	message is that SEND messages cause notifications
	to be posted to the completion queue (CQ) on the
	infiniband receiver side, whereas RDMA messages (used
	for VM's ram) do not (to behave like an actual DMA).

	Messages in infiniband require two things:

	1. registration of the memory that will be transmitted
	2. (SEND only) work requests to be posted on both
	sides of the network before the actual transmission
	can occur.

	RDMA messages are much easier to deal with. Once the memory
	on the receiver side is registered and pinned, we're
	basically done. All that is required is for the sender
	side to start dumping bytes onto the link.

	(Memory is not released from pinning until the migration
	completes, given that RDMA migrations are very fast.)

	SEND messages require more coordination because the
	receiver must have reserved space (using a receive
	work request) on the receive queue (RQ) before QEMUFileRDMA
	can start using them to carry all the bytes as
	a control transport for migration of device state.

	To begin the migration, the initial connection setup is
	as follows (migration-rdma.c):

	1. Receiver and Sender are started (command line or libvirt):
	2. Both sides post two RQ work requests
	3. Receiver does listen()
	4. Sender does connect()
	5. Receiver accept()
	6. Check versioning and capabilities (described later)

	At this point, we define a control channel on top of SEND messages
	which is described by a formal protocol. Each SEND message has a
	header portion and a data portion (but together are transmitted
	as a single SEND message).

	Header:
	* Length (of the data portion, uint32, network byte order)
	* Type (what command to perform, uint32, network byte order)
	* Repeat (Number of commands in data portion, same type only)

	The 'Repeat' field is here to support future multiple page registrations
	in a single message without any need to change the protocol itself
	so that the protocol is compatible against multiple versions of QEMU.
	Version #1 requires that all server implementations of the protocol must
	check this field and register all requests found in the array of commands located
	in the data portion and return an equal number of results in the response.
	The maximum number of repeats is hard-coded to 4096. This is a conservative
	limit based on the maximum size of a SEND message along with empirical
	observations on the maximum future benefit of simultaneous page registrations.

	The 'type' field has 12 different command values:
	1. Unused
	2. Error (sent to the source during bad things)
	3. Ready (control-channel is available)
	4. QEMU File (for sending non-live device state)
	5. RAM Blocks request (used right after connection setup)
	6. RAM Blocks result (used right after connection setup)
	7. Compress page (zap zero page and skip registration)
	8. Register request (dynamic chunk registration)
	9. Register result ('rkey' to be used by sender)
	10. Register finished (registration for current iteration finished)
	11. Unregister request (unpin previously registered memory)
	12. Unregister finished (confirmation that unpin completed)

	A single control message, as hinted above, can contain within the data
	portion an array of many commands of the same type. If there is more than
	one command, then the 'repeat' field will be greater than 1.

	After connection setup, message 5 & 6 are used to exchange ram block
	information and optionally pin all the memory if requested by the user.

	After ram block exchange is completed, we have two protocol-level
	functions, responsible for communicating control-channel commands
	using the above list of values:

	Logically:

	qemu_rdma_exchange_recv(header, expected command type)

	1. We transmit a READY command to let the sender know that
	we are ready to receive some data bytes on the control channel.
	2. Before attempting to receive the expected command, we post another
	RQ work request to replace the one we just used up.
	3. Block on a CQ event channel and wait for the SEND to arrive.
	4. When the send arrives, librdmacm will unblock us.
	5. Verify that the command-type and version received matches the one we expected.

	qemu_rdma_exchange_send(header, data, optional response header & data):

	1. Block on the CQ event channel waiting for a READY command
	from the receiver to tell us that the receiver
	is ready for us to transmit some new bytes.
	2. Optionally: if we are expecting a response from the command
	(that we have not yet transmitted), let's post an RQ
	work request to receive that data a few moments later.
	3. When the READY arrives, librdmacm will
	unblock us and we immediately post a RQ work request
	to replace the one we just used up.
	4. Now, we can actually post the work request to SEND
	the requested command type of the header we were asked for.
	5. Optionally, if we are expecting a response (as before),
	we block again and wait for that response using the additional
	work request we previously posted. (This is used to carry
	'Register result' commands #6 back to the sender which
	hold the rkey need to perform RDMA. Note that the virtual address
	corresponding to this rkey was already exchanged at the beginning
	of the connection (described below).

	All of the remaining command types (not including 'ready')
	described above all use the aforementioned two functions to do the hard work:

	1. After connection setup, RAMBlock information is exchanged using
	this protocol before the actual migration begins. This information includes
	a description of each RAMBlock on the server side as well as the virtual addresses
	and lengths of each RAMBlock. This is used by the client to determine the
	start and stop locations of chunks and how to register them dynamically
	before performing the RDMA operations.
	2. During runtime, once a 'chunk' becomes full of pages ready to
	be sent with RDMA, the registration commands are used to ask the
	other side to register the memory for this chunk and respond
	with the result (rkey) of the registration.
	3. Also, the QEMUFile interfaces also call these functions (described below)
	when transmitting non-live state, such as devices or to send
	its own protocol information during the migration process.
	4. Finally, zero pages are only checked if a page has not yet been registered
	using chunk registration (or not checked at all and unconditionally
	written if chunk registration is disabled. This is accomplished using
	the "Compress" command listed above. If the page has been registered
	then we check the entire chunk for zero. Only if the entire chunk is
	zero, then we send a compress command to zap the page on the other side.

	Versioning and Capabilities
	===========================
	Current version of the protocol is version #1.

	The same version applies to both for protocol traffic and capabilities
	negotiation. (i.e. There is only one version number that is referred to
	by all communication).

	librdmacm provides the user with a 'private data' area to be exchanged
	at connection-setup time before any infiniband traffic is generated.

	Header:
	* Version (protocol version validated before send/recv occurs),
	uint32, network byte order
	* Flags (bitwise OR of each capability),
	uint32, network byte order

	There is no data portion of this header right now, so there is
	no length field. The maximum size of the 'private data' section
	is only 192 bytes per the Infiniband specification, so it's not
	very useful for data anyway. This structure needs to remain small.

	This private data area is a convenient place to check for protocol
	versioning because the user does not need to register memory to
	transmit a few bytes of version information.

	This is also a convenient place to negotiate capabilities
	(like dynamic page registration).

	If the version is invalid, we throw an error.

	If the version is new, we only negotiate the capabilities that the
	requested version is able to perform and ignore the rest.

	Currently there is only one capability in Version #1: dynamic page registration

	Finally: Negotiation happens with the Flags field: If the primary-VM
	sets a flag, but the destination does not support this capability, it
	will return a zero-bit for that flag and the primary-VM will understand
	that as not being an available capability and will thus disable that
	capability on the primary-VM side.

	QEMUFileRDMA Interface:
	=======================

	QEMUFileRDMA introduces a couple of new functions:

	1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
	2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)

	These two functions are very short and simply use the protocol
	describe above to deliver bytes without changing the upper-level
	users of QEMUFile that depend on a bytestream abstraction.

	Finally, how do we handoff the actual bytes to get_buffer()?

	Again, because we're trying to "fake" a bytestream abstraction
	using an analogy not unlike individual UDP frames, we have
	to hold on to the bytes received from control-channel's SEND
	messages in memory.

	Each time we receive a complete "QEMU File" control-channel
	message, the bytes from SEND are copied into a small local holding area.

	Then, we return the number of bytes requested by get_buffer()
	and leave the remaining bytes in the holding area until get_buffer()
	comes around for another pass.

	If the buffer is empty, then we follow the same steps
	listed above and issue another "QEMU File" protocol command,
	asking for a new SEND message to re-fill the buffer.

	Migration of VM's ram:
	====================

	At the beginning of the migration, (migration-rdma.c),
	the sender and the receiver populate the list of RAMBlocks
	to be registered with each other into a structure.
	Then, using the aforementioned protocol, they exchange a
	description of these blocks with each other, to be used later
	during the iteration of main memory. This description includes
	a list of all the RAMBlocks, their offsets and lengths, virtual
	addresses and possibly includes pre-registered RDMA keys in case dynamic
	page registration was disabled on the server-side, otherwise not.

	Main memory is not migrated with the aforementioned protocol,
	but is instead migrated with normal RDMA Write operations.

	Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
	Chunk size is not dynamic, but it could be in a future implementation.
	There's nothing to indicate that this is useful right now.

	When a chunk is full (or a flush() occurs), the memory backed by
	the chunk is registered with librdmacm is pinned in memory on
	both sides using the aforementioned protocol.
	After pinning, an RDMA Write is generated and transmitted
	for the entire chunk.

	Chunks are also transmitted in batches: This means that we
	do not request that the hardware signal the completion queue
	for the completion of every chunk. The current batch size
	is about 64 chunks (corresponding to 64 MB of memory).
	Only the last chunk in a batch must be signaled.
	This helps keep everything as asynchronous as possible
	and helps keep the hardware busy performing RDMA operations.

	Error-handling:
	===============

	Infiniband has what is called a "Reliable, Connected"
	link (one of 4 choices). This is the mode in which
	we use for RDMA migration.

	If a single message fails,
	the decision is to abort the migration entirely and
	cleanup all the RDMA descriptors and unregister all
	the memory.

	After cleanup, the Virtual Machine is returned to normal
	operation the same way that would happen if the TCP
	socket is broken during a non-RDMA based migration.

	TODO:
	=====
	1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
	are not compatible with infiniband memory pinning and will result in
	an aborted migration (but with the source VM left unaffected).
	2. Use of the recent /proc/<pid>/pagemap would likely speed up
	the use of KSM and ballooning while using RDMA.
	3. Also, some form of balloon-device usage tracking would also
	help alleviate some issues.
	4. Use LRU to provide more fine-grained direction of UNREGISTER
	requests for unpinning memory in an overcommitted environment.
	5. Expose UNREGISTER support to the user by way of workload-specific
	hints about application behavior.