| ============== |
| NVMe Emulation |
| ============== |
| |
| QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and |
| ``nvme-subsys`` devices. |
| |
| See the following sections for specific information on |
| |
| * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. |
| * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, |
| `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data |
| Protection`_, |
| |
| Adding NVMe Devices |
| =================== |
| |
| Controller Emulation |
| -------------------- |
| |
| The QEMU emulated NVMe controller implements version 1.4 of the NVM Express |
| specification. All mandatory features are implement with a couple of exceptions |
| and limitations: |
| |
| * Accounting numbers in the SMART/Health log page are reset when the device |
| is power cycled. |
| * Interrupt Coalescing is not supported and is disabled by default. |
| |
| The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the |
| following parameters: |
| |
| .. code-block:: console |
| |
| -drive file=nvm.img,if=none,id=nvm |
| -device nvme,serial=deadbeef,drive=nvm |
| |
| There are a number of optional general parameters for the ``nvme`` device. Some |
| are mentioned here, but see ``-device nvme,help`` to list all possible |
| parameters. |
| |
| ``max_ioqpairs=UINT32`` (default: ``64``) |
| Set the maximum number of allowed I/O queue pairs. This replaces the |
| deprecated ``num_queues`` parameter. |
| |
| ``msix_qsize=UINT16`` (default: ``65``) |
| The number of MSI-X vectors that the device should support. |
| |
| ``mdts=UINT8`` (default: ``7``) |
| Set the Maximum Data Transfer Size of the device. |
| |
| ``use-intel-id`` (default: ``off``) |
| Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and |
| Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID |
| previously used. |
| |
| Additional Namespaces |
| --------------------- |
| |
| In the simplest possible invocation sketched above, the device only support a |
| single namespace with the namespace identifier ``1``. To support multiple |
| namespaces and additional features, the ``nvme-ns`` device must be used. |
| |
| .. code-block:: console |
| |
| -device nvme,id=nvme-ctrl-0,serial=deadbeef |
| -drive file=nvm-1.img,if=none,id=nvm-1 |
| -device nvme-ns,drive=nvm-1 |
| -drive file=nvm-2.img,if=none,id=nvm-2 |
| -device nvme-ns,drive=nvm-2 |
| |
| The namespaces defined by the ``nvme-ns`` device will attach to the most |
| recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace |
| identifiers are allocated automatically, starting from ``1``. |
| |
| There are a number of parameters available: |
| |
| ``nsid`` (default: ``0``) |
| Explicitly set the namespace identifier. |
| |
| ``uuid`` (default: *autogenerated*) |
| Set the UUID of the namespace. This will be reported as a "Namespace UUID" |
| descriptor in the Namespace Identification Descriptor List. |
| |
| ``eui64`` |
| Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended |
| Unique Identifier" descriptor in the Namespace Identification Descriptor List. |
| Since machine type 6.1 a non-zero default value is used if the parameter |
| is not provided. For earlier machine types the field defaults to 0. |
| |
| ``bus`` |
| If there are more ``nvme`` devices defined, this parameter may be used to |
| attach the namespace to a specific ``nvme`` device (identified by an ``id`` |
| parameter on the controller device). |
| |
| NVM Subsystems |
| -------------- |
| |
| Additional features becomes available if the controller device (``nvme``) is |
| linked to an NVM Subsystem device (``nvme-subsys``). |
| |
| The NVM Subsystem emulation allows features such as shared namespaces and |
| multipath I/O. |
| |
| .. code-block:: console |
| |
| -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 |
| -device nvme,serial=deadbeef,subsys=nvme-subsys-0 |
| -device nvme,serial=deadbeef,subsys=nvme-subsys-0 |
| |
| This will create an NVM subsystem with two controllers. Having controllers |
| linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: |
| |
| ``shared`` (default: ``on`` since 6.2) |
| Specifies that the namespace will be attached to all controllers in the |
| subsystem. If set to ``off``, the namespace will remain a private namespace |
| and may only be attached to a single controller at a time. Shared namespaces |
| are always automatically attached to all controllers (also when controllers |
| are hotplugged). |
| |
| ``detached`` (default: ``off``) |
| If set to ``on``, the namespace will be be available in the subsystem, but |
| not attached to any controllers initially. A shared namespace with this set |
| to ``on`` will never be automatically attached to controllers. |
| |
| Thus, adding |
| |
| .. code-block:: console |
| |
| -drive file=nvm-1.img,if=none,id=nvm-1 |
| -device nvme-ns,drive=nvm-1,nsid=1 |
| -drive file=nvm-2.img,if=none,id=nvm-2 |
| -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on |
| |
| will cause NSID 1 will be a shared namespace that is initially attached to both |
| controllers. NSID 3 will be a private namespace due to ``shared=off`` and only |
| attachable to a single controller at a time. Additionally it will not be |
| attached to any controller initially (due to ``detached=on``) or to hotplugged |
| controllers. |
| |
| Optional Features |
| ================= |
| |
| Controller Memory Buffer |
| ------------------------ |
| |
| ``nvme`` device parameters related to the Controller Memory Buffer support: |
| |
| ``cmb_size_mb=UINT32`` (default: ``0``) |
| This adds a Controller Memory Buffer of the given size at offset zero in BAR |
| 2. |
| |
| ``legacy-cmb`` (default: ``off``) |
| By default, the device uses the "v1.4 scheme" for the Controller Memory |
| Buffer support (i.e, the CMB is initially disabled and must be explicitly |
| enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the |
| CMB. |
| |
| Simple Copy |
| ----------- |
| |
| The device includes support for TP 4065 ("Simple Copy Command"). A number of |
| additional ``nvme-ns`` device parameters may be used to control the Copy |
| command limits: |
| |
| ``mssrl=UINT16`` (default: ``128``) |
| Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum |
| number of logical blocks that may be specified in each source range. |
| |
| ``mcl=UINT32`` (default: ``128``) |
| Set the Maximum Copy Length (``MCL``). This is the maximum number of logical |
| blocks that may be specified in a Copy command (the total for all source |
| ranges). |
| |
| ``msrc=UINT8`` (default: ``127``) |
| Set the Maximum Source Range Count (``MSRC``). This is the maximum number of |
| source ranges that may be used in a Copy command. This is a 0's based value. |
| |
| Zoned Namespaces |
| ---------------- |
| |
| A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set |
| ``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. |
| |
| The namespace may be configured with additional parameters |
| |
| ``zoned.zone_size=SIZE`` (default: ``128MiB``) |
| Define the zone size (``ZSZE``). |
| |
| ``zoned.zone_capacity=SIZE`` (default: ``0``) |
| Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone |
| capacity will equal the zone size. |
| |
| ``zoned.descr_ext_size=UINT32`` (default: ``0``) |
| Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 |
| bytes. |
| |
| ``zoned.cross_read=BOOL`` (default: ``off``) |
| Set to ``on`` to allow reads to cross zone boundaries. |
| |
| ``zoned.max_active=UINT32`` (default: ``0``) |
| Set the maximum number of active resources (``MAR``). The default (``0``) |
| allows all zones to be active. |
| |
| ``zoned.max_open=UINT32`` (default: ``0``) |
| Set the maximum number of open resources (``MOR``). The default (``0``) |
| allows all zones to be open. If ``zoned.max_active`` is specified, this value |
| must be less than or equal to that. |
| |
| ``zoned.zasl=UINT8`` (default: ``0``) |
| Set the maximum data transfer size for the Zone Append command. Like |
| ``mdts``, the value is specified as a power of two (2^n) and is in units of |
| the minimum memory page size (CAP.MPSMIN). The default value (``0``) |
| has this property inherit the ``mdts`` value. |
| |
| Flexible Data Placement |
| ----------------------- |
| |
| The device may be configured to support TP4146 ("Flexible Data Placement") by |
| configuring it (``fdp=on``) on the subsystem:: |
| |
| -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16 |
| |
| The subsystem emulates a single Endurance Group, on which Flexible Data |
| Placement will be supported. Also note that the device emulation deviates |
| slightly from the specification, by always enabling the "FDP Mode" feature on |
| the controller if the subsystems is configured for Flexible Data Placement. |
| |
| Enabling Flexible Data Placement on the subsyste enables the following |
| parameters: |
| |
| ``fdp.nrg`` (default: ``1``) |
| Set the number of Reclaim Groups. |
| |
| ``fdp.nruh`` (default: ``0``) |
| Set the number of Reclaim Unit Handles. This is a mandatory parameter and |
| must be non-zero. |
| |
| ``fdp.runs`` (default: ``96M``) |
| Set the Reclaim Unit Nominal Size. Defaults to 96 MiB. |
| |
| Namespaces within this subsystem may requests Reclaim Unit Handles:: |
| |
| -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST |
| |
| The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may |
| include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified, |
| the controller will assign the controller-specified reclaim unit handle to |
| placement handle identifier 0. |
| |
| Metadata |
| -------- |
| |
| The virtual namespace device supports LBA metadata in the form separate |
| metadata (``MPTR``-based) and extended LBAs. |
| |
| ``ms=UINT16`` (default: ``0``) |
| Defines the number of metadata bytes per LBA. |
| |
| ``mset=UINT8`` (default: ``0``) |
| Set to ``1`` to enable extended LBAs. |
| |
| End-to-End Data Protection |
| -------------------------- |
| |
| The virtual namespace device supports DIF- and DIX-based protection information |
| (depending on ``mset``). |
| |
| ``pi=UINT8`` (default: ``0``) |
| Enable protection information of the specified type (type ``1``, ``2`` or |
| ``3``). |
| |
| ``pil=UINT8`` (default: ``0``) |
| Controls the location of the protection information within the metadata. Set |
| to ``1`` to transfer protection information as the first bytes of metadata. |
| Otherwise, the protection information is transferred as the last bytes of |
| metadata. |
| |
| ``pif=UINT8`` (default: ``0``) |
| By default, the namespace device uses 16 bit guard protection information |
| format (``pif=0``). Set to ``2`` to enable 64 bit guard protection |
| information format. This requires at least 16 bytes of metadata. Note that |
| ``pif=1`` (32 bit guards) is currently not supported. |
| |
| Virtualization Enhancements and SR-IOV (Experimental Support) |
| ------------------------------------------------------------- |
| |
| The ``nvme`` device supports Single Root I/O Virtualization and Sharing |
| along with Virtualization Enhancements. The controller has to be linked to |
| an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. |
| |
| A number of parameters are present (**please note, that they may be |
| subject to change**): |
| |
| ``sriov_max_vfs`` (default: ``0``) |
| Indicates the maximum number of PCIe virtual functions supported |
| by the controller. Specifying a non-zero value enables reporting of both |
| SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities |
| by the NVMe device. Virtual function controllers will not report SR-IOV. |
| |
| ``sriov_vq_flexible`` |
| Indicates the total number of flexible queue resources assignable to all |
| the secondary controllers. Implicitly sets the number of primary |
| controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. |
| |
| ``sriov_vi_flexible`` |
| Indicates the total number of flexible interrupt resources assignable to |
| all the secondary controllers. Implicitly sets the number of primary |
| controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. |
| |
| ``sriov_max_vi_per_vf`` (default: ``0``) |
| Indicates the maximum number of virtual interrupt resources assignable |
| to a secondary controller. The default ``0`` resolves to |
| ``(sriov_vi_flexible / sriov_max_vfs)`` |
| |
| ``sriov_max_vq_per_vf`` (default: ``0``) |
| Indicates the maximum number of virtual queue resources assignable to |
| a secondary controller. The default ``0`` resolves to |
| ``(sriov_vq_flexible / sriov_max_vfs)`` |
| |
| The simplest possible invocation enables the capability to set up one VF |
| controller and assign an admin queue, an IO queue, and a MSI-X interrupt. |
| |
| .. code-block:: console |
| |
| -device nvme-subsys,id=subsys0 |
| -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, |
| sriov_vq_flexible=2,sriov_vi_flexible=1 |
| |
| The minimum steps required to configure a functional NVMe secondary |
| controller are: |
| |
| * unbind flexible resources from the primary controller |
| |
| .. code-block:: console |
| |
| nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 |
| nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 |
| |
| * perform a Function Level Reset on the primary controller to actually |
| release the resources |
| |
| .. code-block:: console |
| |
| echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset |
| |
| * enable VF |
| |
| .. code-block:: console |
| |
| echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs |
| |
| * assign the flexible resources to the VF and set it ONLINE |
| |
| .. code-block:: console |
| |
| nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 |
| nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 |
| nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 |
| |
| * bind the NVMe driver to the VF |
| |
| .. code-block:: console |
| |
| echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind |