re-work SG dirty tracking (#672)

Move SG dirtying to vfu_unmap_sg(): as we don't want to track SGs
ourselves, doing this in vfu_map_sg() is no longer the right place.

Note that the lack of tracking implies that any SGs must be unmapped
before the final stop and copy phase. To avoid the need for this, add
vfu_mark_sg_dirty(): this allows a consumer to mark a region as dirty
explicitly without needing to unmap it. Currently it's the same as
vfu_unmap_sg(), but that's an implementation detail.

Note this still marks current maps after a get operation; that will
change subsequently.

Signed-off-by: John Levon <john.levon@nutanix.com>
Reviewed-by: Thanos Makatos <thanos.makatos@nutanix.com>
4 files changed
tree: 726c6f38b8eeb23a28cf194afca0f39117aa034a
  1. .github/
  2. docs/
  3. include/
  4. lib/
  5. samples/
  6. test/
  7. .ctags
  8. .gitignore
  9. .gitmodules
  10. LICENSE
  11. Makefile
  12. meson.build
  13. meson_options.txt
  14. README.md
README.md

libvfio-user

vfio-user is a framework that allows implementing PCI devices in userspace. Clients (such as qemu) talk the vfio-user protocol over a UNIX socket to a server. This library, libvfio-user, provides an API for implementing such servers.

vfio-user example block diagram

VFIO is a kernel facility for providing secure access to PCI devices in userspace (including pass-through to a VM). With vfio-user, instead of talking to the kernel, all interactions are done in userspace, without requiring any kernel component; the kernel VFIO implementation is not used at all for a vfio-user device.

Put another way, vfio-user is to VFIO as vhost-user is to vhost.

The vfio-user protocol is intentionally modelled after the VFIO ioctl() interface, and shares many of its definitions. However, there is not an exact equivalence: for example, IOMMU groups are not represented in vfio-user.

There many different purposes you might put this library to, such as prototyping novel devices, testing frameworks, implementing alternatives to qemu's device emulation, adapting a device class to work over a network, etc.

The library abstracts most of the complexity around representing the device. Applications using libvfio-user provide a description of the device (eg. region and IRQ information) and as set of callbacks which are invoked by libvfio-user when those regions are accessed.

Memory Mapping the Device

The device driver can allow parts of the virtual device to be memory mapped by the virtual machine (e.g. the PCI BARs). The business logic needs to implement the mmap callback and reply to the request passing the memory address whose backing pages are then used to satisfy the original mmap call. Currently reading and writing of the memory mapped memory by the client goes undetected by libvfio-user, the business logic needs to poll. In the future we plan to implement a mechanism in order to provide notifications to libvfio-user whenever a page is written to.

Interrupts

Interrupts are implemented by passing the event file descriptor to libvfio-user and then notifying it about it. libvfio-user can then trigger interrupts simply by writing to it. This can be much more expensive compared to triggering interrupts from the kernel, however this performance penalty is perfectly acceptable when prototyping the functional aspect of a device driver.

Building libvfio-user

Build requirements:

  • meson (v0.53.0 or above)
  • apt install libjson-c-dev libcmocka-dev or
  • yum install json-c-devel libcmocka-devel

To build:

meson build
ninja -C build

The kernel headers are necessary because VFIO structs and defines are reused.

Finally build your program and link with libvfio-user.so.

Coverity

make coverity automatically uploads a new coverity build.

Supported features

With the client support found in cloud-hypervisor or the in-development qemu support, most guest VM use cases will work. See below for some details on how to try this out.

However, guests with an IOMMU (vIOMMU) will not currently work: the number of DMA regions is strictly limited, and there are also issues with some server implementations such as SPDK's virtual NVMe controller.

Currently, libvfio-user has explicit support for PCI devices only. In addition, only PCI endpoints are supported (no bridges etc.).

API

Currently there is one, single-threaded, library context per device, however the application can employ any form of concurrency needed. In the future we plan to make libvfio-user multi-thread safe.

The library (and the protocol) are actively under development, and should not yet be considered a stable API or interface.

The API is currently documented via the libvfio-user header file.

Mailing List & Chat

libvfio-user development is discussed in libvfio-user-devel@nongnu.org. Subscribe here: https://lists.gnu.org/mailman/listinfo/libvfio-user-devel.

We are on Slack at libvfio-user.slack.com (invite link); or IRC at #qemu on OFTC.

Contributing

Contributions are welcome; please file an issue or open a PR. Anything substantial is worth discussing with us first.

Please make sure to mark any commits with Signed-off-by (git commit -s), which signals agreement with the Developer Certificate of Origin v1.1.

Examples

The samples directory contains various libvfio-user samples.

lspci

lspci implements an example of how to dump the PCI header of a libvfio-user device and examine it with lspci(8):

# lspci -vv -F <(build/dbg/samples/lspci)
00:00.0 Non-VGA unclassified device: Device 0000:0000
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Region 0: I/O ports at <unassigned> [disabled]
        Region 1: I/O ports at <unassigned> [disabled]
        Region 2: I/O ports at <unassigned> [disabled]
        Region 3: I/O ports at <unassigned> [disabled]
        Region 4: I/O ports at <unassigned> [disabled]
        Region 5: I/O ports at <unassigned> [disabled]
        Capabilities: [40] Power Management version 0
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

The above sample implements a very simple PCI device that supports the Power Management PCI capability. The sample can be trivially modified to change the PCI configuration space header and add more PCI capabilities.

Client/Server Implementation

Client/server implements a basic client/server model where basic tasks are performed.

The server implements a device that can be programmed to trigger interrupts (INTx) to the client. This is done by writing the desired time in seconds since Epoch to BAR0. The server then triggers an eventfd-based IRQ and then a message-based one (in order to demonstrate how it‘s done when passing of file descriptors isn’t possible/desirable). The device also works as memory storage: BAR1 can be freely written to/read from by the host.

Since this is a completely made up device, there's no kernel driver (yet). Client implements a client that knows how to drive this particular device (that would normally be QEMU + guest VM + kernel driver).

The client excercises all commands in the vfio-user protocol, and then proceeds to perform live migration. The client spawns the destination server (this would be normally done by libvirt) and then migrates the device state, before switching entirely to the destination server. We re-use the source client instead of spawning a destination one as this is something libvirt/QEMU would normally do.

To spice things up, the client programs the source server to trigger an interrupt and then migrates to the destination server; the programmed interrupt is delivered by the destination server. Also, while the device is being live migrated, the client spawns a thread that constantly writes to BAR1 in a tight loop. This thread emulates the guest VM accessing the device while the main thread (what would normally be QEMU) is driving the migration.

Start the source server as follows (pick whatever you like for /tmp/vfio-user.sock):

rm -f /tmp/vfio-user.sock* ; build/dbg/samples/server -v /tmp/vfio-user.sock

And then the client:

build/dbg/samples/client /tmp/vfio-user.sock

After a couple of seconds the client will start live migration. The source server will exit and the destination server will start, watch the client terminal for destination server messages.

gpio

A gpio server implements a very simple GPIO device that can be used with a Linux VM.

First, download and build this branch of qemu.

Start the gpio server process:

rm /tmp/vfio-user.sock
./build/dbg/samples/gpio-pci-idio-16 -v /tmp/vfio-user.sock &

Create a Linux install image, or use a pre-made one. You‘ll probably also need to build the gpio-pci-idio-16 kernel module yourself - it’s part of the standard Linux kernel, but not usually built and shipped on x86. Start your guest VM:

./x86_64-softmmu/qemu-system-x86_64  -mem-prealloc -m 256 \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/gpio,share=yes,size=256M \
-numa node,memdev=ram-node0 \
-kernel ~/vmlinuz -initrd ~/initrd -nographic \
-append "console=ttyS0 root=/dev/sda1 single" \
-hda ~/bionic-server-cloudimg-amd64-0.raw \
-device vfio-user-pci,socket=/tmp/vfio-user.sock

Log in to your guest VM, and you should be able to load the module and observe the emulated GPIO device's pins:

insmod gpio-pci-idio-16.ko
cat /sys/class/gpio/gpiochip480/base > /sys/class/gpio/export
for ((i=0;i<12;i++)); do cat /sys/class/gpio/OUT0/value; done

Other devices based on libvfio-user:

  • nvmf/vfio-user: a virtual NVMe controller using SPDK, see docs/spdk.md for more details.

libvirt

  1. Add xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' to the domain element.

  2. Enable sharing of the guest's RAM:

  3. Pass the vfio-user device:

    qemu:commandline <qemu:arg value=‘-device’/> <qemu:arg value=‘vfio-user-pci,socket=/var/run/vfio-user.sock,x-enable-migration=on’/> </qemu:commandline>

Testing

See Testing.

History

This project was formerly known as “muser”, short for “Mediated Userspace Device”. It implemented a proof-of-concept VFIO mediated device in userspace. Normally, VFIO mdev devices require a kernel module; muser implemented a small kernel module that forwarded onto userspace. The old kernel-module-based implementation can be found in the kmod branch.

License

Copyright © 2019-2020 Nutanix Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of Nutanix nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.