| QEMU Virtual NVDIMM |
| =================== |
| |
| This document explains the usage of virtual NVDIMM (vNVDIMM) feature |
| which is available since QEMU v2.6.0. |
| |
| The current QEMU only implements the persistent memory mode of vNVDIMM |
| device and not the block window mode. |
| |
| Basic Usage |
| ----------- |
| |
| The storage of a vNVDIMM device in QEMU is provided by the memory |
| backend (i.e. memory-backend-file and memory-backend-ram). A simple |
| way to create a vNVDIMM device at startup time is done via the |
| following command line options: |
| |
| -machine pc,nvdimm |
| -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE |
| -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE |
| -device nvdimm,id=nvdimm1,memdev=mem1 |
| |
| Where, |
| |
| - the "nvdimm" machine option enables vNVDIMM feature. |
| |
| - "slots=$N" should be equal to or larger than the total amount of |
| normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. |
| |
| - "maxmem=$MAX_SIZE" should be equal to or larger than the total size |
| of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be |
| >= $RAM_SIZE + $NVDIMM_SIZE here. |
| |
| - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE" |
| creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All |
| accesses to the virtual NVDIMM device go to the file $PATH. |
| |
| "share=on/off" controls the visibility of guest writes. If |
| "share=on", then guest writes will be applied to the backend |
| file. If another guest uses the same backend file with option |
| "share=on", then above writes will be visible to it as well. If |
| "share=off", then guest writes won't be applied to the backend |
| file and thus will be invisible to other guests. |
| |
| - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM |
| device whose storage is provided by above memory backend device. |
| |
| Multiple vNVDIMM devices can be created if multiple pairs of "-object" |
| and "-device" are provided. |
| |
| For above command line options, if the guest OS has the proper NVDIMM |
| driver, it should be able to detect a NVDIMM device which is in the |
| persistent memory mode and whose size is $NVDIMM_SIZE. |
| |
| Note: |
| |
| 1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual |
| backend file size is not equal to the size given by "size" option, |
| QEMU will truncate the backend file by ftruncate(2), which will |
| corrupt the existing data in the backend file, especially for the |
| shrink case. |
| |
| QEMU v2.8.0 and later check the backend file size and the "size" |
| option. If they do not match, QEMU will report errors and abort in |
| order to avoid the data corruption. |
| |
| 2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" |
| option of memory-backend-file, e.g. 4KB alignment on x86. However, |
| QEMU v.2.7.0 puts an additional alignment requirement, which may |
| require a larger value than the basic one, e.g. 2MB on x86. This |
| change breaks the usage of memory-backend-file that only satisfies |
| the basic alignment. |
| |
| QEMU v2.8.0 and later remove the additional alignment on non-s390x |
| architectures, so the broken memory-backend-file can work again. |
| |
| Label |
| ----- |
| |
| QEMU v2.7.0 and later implement the label support for vNVDIMM devices. |
| To enable label on vNVDIMM devices, users can simply add |
| "label-size=$SZ" option to "-device nvdimm", e.g. |
| |
| -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K |
| |
| Note: |
| |
| 1. The minimal label size is 128KB. |
| |
| 2. QEMU v2.7.0 and later store labels at the end of backend storage. |
| If a memory backend file, which was previously used as the backend |
| of a vNVDIMM device without labels, is now used for a vNVDIMM |
| device with label, the data in the label area at the end of file |
| will be inaccessible to the guest. If any useful data (e.g. the |
| meta-data of the file system) was stored there, the latter usage |
| may result guest data corruption (e.g. breakage of guest file |
| system). |
| |
| Hotplug |
| ------- |
| |
| QEMU v2.8.0 and later implement the hotplug support for vNVDIMM |
| devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is |
| accomplished by two monitor commands "object_add" and "device_add". |
| |
| For example, the following commands add another 4GB vNVDIMM device to |
| the guest: |
| |
| (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G |
| (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 |
| |
| Note: |
| |
| 1. Each hotplugged vNVDIMM device consumes one memory slot. Users |
| should always ensure the memory option "-m ...,slots=N" specifies |
| enough number of slots, i.e. |
| N >= number of RAM devices + |
| number of statically plugged vNVDIMM devices + |
| number of hotplugged vNVDIMM devices |
| |
| 2. The similar is required for the memory option "-m ...,maxmem=M", i.e. |
| M >= size of RAM devices + |
| size of statically plugged vNVDIMM devices + |
| size of hotplugged vNVDIMM devices |
| |
| Alignment |
| --------- |
| |
| QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping |
| address to the page size (getpagesize(2)) by default. However, some |
| types of backends may require an alignment different than the page |
| size. In that case, QEMU v2.12.0 and later provide 'align' option to |
| memory-backend-file to allow users to specify the proper alignment. |
| |
| For example, device dax require the 2 MB alignment, so we can use |
| following QEMU command line options to use it (/dev/dax0.0) as the |
| backend of vNVDIMM: |
| |
| -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M |
| -device nvdimm,id=nvdimm1,memdev=mem1 |
| |
| Guest Data Persistence |
| ---------------------- |
| |
| Though QEMU supports multiple types of vNVDIMM backends on Linux, |
| currently the only one that can guarantee the guest write persistence |
| is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to |
| which all guest access do not involve any host-side kernel cache. |
| |
| When using other types of backends, it's suggested to set 'unarmed' |
| option of '-device nvdimm' to 'on', which sets the unarmed flag of the |
| guest NVDIMM region mapping structure. This unarmed flag indicates |
| guest software that this vNVDIMM device contains a region that cannot |
| accept persistent writes. In result, for example, the guest Linux |
| NVDIMM driver, marks such vNVDIMM device as read-only. |
| |
| NVDIMM Persistence |
| ------------------ |
| |
| ACPI 6.2 Errata A added support for a new Platform Capabilities Structure |
| which allows the platform to communicate what features it supports related to |
| NVDIMM data persistence. Users can provide a persistence value to a guest via |
| the optional "nvdimm-persistence" machine command line option: |
| |
| -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu |
| |
| There are currently two valid values for this option: |
| |
| "mem-ctrl" - The platform supports flushing dirty data from the memory |
| controller to the NVDIMMs in the event of power loss. |
| |
| "cpu" - The platform supports flushing dirty data from the CPU cache to |
| the NVDIMMs in the event of power loss. This implies that the |
| platform also supports flushing dirty data through the memory |
| controller on power loss. |