| qcow2 L2/refcount cache configuration |
| ===================================== |
| Copyright (C) 2015, 2018-2020 Igalia, S.L. |
| Author: Alberto Garcia <berto@igalia.com> |
| |
| This work is licensed under the terms of the GNU GPL, version 2 or |
| later. See the COPYING file in the top-level directory. |
| |
| Introduction |
| ------------ |
| The QEMU qcow2 driver has two caches that can improve the I/O |
| performance significantly. However, setting the right cache sizes is |
| not a straightforward operation. |
| |
| This document attempts to give an overview of the L2 and refcount |
| caches, and how to configure them. |
| |
| Please refer to the docs/interop/qcow2.txt file for an in-depth |
| technical description of the qcow2 file format. |
| |
| |
| Clusters |
| -------- |
| A qcow2 file is organized in units of constant size called clusters. |
| |
| The cluster size is configurable, but it must be a power of two and |
| its value 512 bytes or higher. QEMU currently defaults to 64 KB |
| clusters, and it does not support sizes larger than 2MB. |
| |
| The 'qemu-img create' command supports specifying the size using the |
| cluster_size option: |
| |
| qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G |
| |
| |
| The L2 tables |
| ------------- |
| The qcow2 format uses a two-level structure to map the virtual disk as |
| seen by the guest to the disk image in the host. These structures are |
| called the L1 and L2 tables. |
| |
| There is one single L1 table per disk image. The table is small and is |
| always kept in memory. |
| |
| There can be many L2 tables, depending on how much space has been |
| allocated in the image. Each table is one cluster in size. In order to |
| read or write data from the virtual disk, QEMU needs to read its |
| corresponding L2 table to find out where that data is located. Since |
| reading the table for each I/O operation can be expensive, QEMU keeps |
| an L2 cache in memory to speed up disk access. |
| |
| The size of the L2 cache can be configured, and setting the right |
| value can improve the I/O performance significantly. |
| |
| |
| The refcount blocks |
| ------------------- |
| The qcow2 format also maintains a reference count for each cluster. |
| Reference counts are used for cluster allocation and internal |
| snapshots. The data is stored in a two-level structure similar to the |
| L1/L2 tables described above. |
| |
| The second level structures are called refcount blocks, are also one |
| cluster in size and the number is also variable and dependent on the |
| amount of allocated space. |
| |
| Each block contains a number of refcount entries. Their size (in bits) |
| is a power of two and must not be higher than 64. It defaults to 16 |
| bits, but a different value can be set using the refcount_bits option: |
| |
| qemu-img create -f qcow2 -o refcount_bits=8 hd.qcow2 4G |
| |
| QEMU keeps a refcount cache to speed up I/O much like the |
| aforementioned L2 cache, and its size can also be configured. |
| |
| |
| Choosing the right cache sizes |
| ------------------------------ |
| In order to choose the cache sizes we need to know how they relate to |
| the amount of allocated space. |
| |
| The part of the virtual disk that can be mapped by the L2 and refcount |
| caches (in bytes) is: |
| |
| disk_size = l2_cache_size * cluster_size / 8 |
| disk_size = refcount_cache_size * cluster_size * 8 / refcount_bits |
| |
| With the default values for cluster_size (64KB) and refcount_bits |
| (16), this becomes: |
| |
| disk_size = l2_cache_size * 8192 |
| disk_size = refcount_cache_size * 32768 |
| |
| So in order to cover n GB of disk space with the default values we |
| need: |
| |
| l2_cache_size = disk_size_GB * 131072 |
| refcount_cache_size = disk_size_GB * 32768 |
| |
| For example, 1MB of L2 cache is needed to cover every 8 GB of the virtual |
| image size (given that the default cluster size is used): |
| |
| 8 GB / 8192 = 1 MB |
| |
| The refcount cache is 4 times the cluster size by default. With the default |
| cluster size of 64 KB, it is 256 KB (262144 bytes). This is sufficient for |
| 8 GB of image size: |
| |
| 262144 * 32768 = 8 GB |
| |
| |
| How to configure the cache sizes |
| -------------------------------- |
| Cache sizes can be configured using the -drive option in the |
| command-line, or the 'blockdev-add' QMP command. |
| |
| There are three options available, and all of them take bytes: |
| |
| "l2-cache-size": maximum size of the L2 table cache |
| "refcount-cache-size": maximum size of the refcount block cache |
| "cache-size": maximum size of both caches combined |
| |
| There are a few things that need to be taken into account: |
| |
| - Both caches must have a size that is a multiple of the cluster size |
| (or the cache entry size: see "Using smaller cache sizes" below). |
| |
| - The maximum L2 cache size is 32 MB by default on Linux platforms (enough |
| for full coverage of 256 GB images, with the default cluster size). This |
| value can be modified using the "l2-cache-size" option. QEMU will not use |
| more memory than needed to hold all of the image's L2 tables, regardless |
| of this max. value. |
| On non-Linux platforms the maximal value is smaller by default (8 MB) and |
| this difference stems from the fact that on Linux the cache can be cleared |
| periodically if needed, using the "cache-clean-interval" option (see below). |
| The minimal L2 cache size is 2 clusters (or 2 cache entries, see below). |
| |
| - The default (and minimum) refcount cache size is 4 clusters. |
| |
| - If only "cache-size" is specified then QEMU will assign as much |
| memory as possible to the L2 cache before increasing the refcount |
| cache size. |
| |
| - At most two of "l2-cache-size", "refcount-cache-size", and "cache-size" |
| can be set simultaneously. |
| |
| Unlike L2 tables, refcount blocks are not used during normal I/O but |
| only during allocations and internal snapshots. In most cases they are |
| accessed sequentially (even during random guest I/O) so increasing the |
| refcount cache size won't have any measurable effect in performance |
| (this can change if you are using internal snapshots, so you may want |
| to think about increasing the cache size if you use them heavily). |
| |
| Before QEMU 2.12 the refcount cache had a default size of 1/4 of the |
| L2 cache size. This resulted in unnecessarily large caches, so now the |
| refcount cache is as small as possible unless overridden by the user. |
| |
| |
| Using smaller cache entries |
| --------------------------- |
| The qcow2 L2 cache can store complete tables. This means that if QEMU |
| needs an entry from an L2 table then the whole table is read from disk |
| and is kept in the cache. If the cache is full then a complete table |
| needs to be evicted first. |
| |
| This can be inefficient with large cluster sizes since it results in |
| more disk I/O and wastes more cache memory. |
| |
| Since QEMU 2.12 you can change the size of the L2 cache entry and make |
| it smaller than the cluster size. This can be configured using the |
| "l2-cache-entry-size" parameter: |
| |
| -drive file=hd.qcow2,l2-cache-size=2097152,l2-cache-entry-size=4096 |
| |
| Since QEMU 4.0 the value of l2-cache-entry-size defaults to 4KB (or |
| the cluster size if it's smaller). |
| |
| Some things to take into account: |
| |
| - The L2 cache entry size has the same restrictions as the cluster |
| size (power of two, at least 512 bytes). |
| |
| - Smaller entry sizes generally improve the cache efficiency and make |
| disk I/O faster. This is particularly true with solid state drives |
| so it's a good idea to reduce the entry size in those cases. With |
| rotating hard drives the situation is a bit more complicated so you |
| should test it first and stay with the default size if unsure. |
| |
| - Try different entry sizes to see which one gives faster performance |
| in your case. The block size of the host filesystem is generally a |
| good default (usually 4096 bytes in the case of ext4, hence the |
| default). |
| |
| - Only the L2 cache can be configured this way. The refcount cache |
| always uses the cluster size as the entry size. |
| |
| - If the L2 cache is big enough to hold all of the image's L2 tables |
| (as explained in the "Choosing the right cache sizes" and "How to |
| configure the cache sizes" sections in this document) then none of |
| this is necessary and you can omit the "l2-cache-entry-size" |
| parameter altogether. In this case QEMU makes the entry size |
| equal to the cluster size by default. |
| |
| |
| Reducing the memory usage |
| ------------------------- |
| It is possible to clean unused cache entries in order to reduce the |
| memory usage during periods of low I/O activity. |
| |
| The parameter "cache-clean-interval" defines an interval (in seconds), |
| after which all the cache entries that haven't been accessed during the |
| interval are removed from memory. Setting this parameter to 0 disables this |
| feature. |
| |
| The following example removes all unused cache entries every 15 minutes: |
| |
| -drive file=hd.qcow2,cache-clean-interval=900 |
| |
| If unset, the default value for this parameter is 600 on platforms which |
| support this functionality, and is 0 (disabled) on other platforms. |
| |
| This functionality currently relies on the MADV_DONTNEED argument for |
| madvise() to actually free the memory. This is a Linux-specific feature, |
| so cache-clean-interval is not supported on other systems. |
| |
| |
| Extended L2 Entries |
| ------------------- |
| All numbers shown in this document are valid for qcow2 images with normal |
| 64-bit L2 entries. |
| |
| Images with extended L2 entries need twice as much L2 metadata, so the L2 |
| cache size must be twice as large for the same disk space. |
| |
| disk_size = l2_cache_size * cluster_size / 16 |
| |
| i.e. |
| |
| l2_cache_size = disk_size * 16 / cluster_size |
| |
| Refcount blocks are not affected by this. |