Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 1 | qcow2 L2/refcount cache configuration |
| 2 | ===================================== |
Alberto Garcia | 30afc12 | 2020-07-10 18:12:49 +0200 | [diff] [blame] | 3 | Copyright (C) 2015, 2018-2020 Igalia, S.L. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 4 | Author: Alberto Garcia <berto@igalia.com> |
| 5 | |
| 6 | This work is licensed under the terms of the GNU GPL, version 2 or |
| 7 | later. See the COPYING file in the top-level directory. |
| 8 | |
| 9 | Introduction |
| 10 | ------------ |
| 11 | The QEMU qcow2 driver has two caches that can improve the I/O |
| 12 | performance significantly. However, setting the right cache sizes is |
| 13 | not a straightforward operation. |
| 14 | |
| 15 | This document attempts to give an overview of the L2 and refcount |
| 16 | caches, and how to configure them. |
| 17 | |
Philippe Mathieu-Daudé | f3fdeb9 | 2017-07-28 19:46:02 -0300 | [diff] [blame] | 18 | Please refer to the docs/interop/qcow2.txt file for an in-depth |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 19 | technical description of the qcow2 file format. |
| 20 | |
| 21 | |
| 22 | Clusters |
| 23 | -------- |
| 24 | A qcow2 file is organized in units of constant size called clusters. |
| 25 | |
| 26 | The cluster size is configurable, but it must be a power of two and |
| 27 | its value 512 bytes or higher. QEMU currently defaults to 64 KB |
| 28 | clusters, and it does not support sizes larger than 2MB. |
| 29 | |
| 30 | The 'qemu-img create' command supports specifying the size using the |
| 31 | cluster_size option: |
| 32 | |
| 33 | qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G |
| 34 | |
| 35 | |
| 36 | The L2 tables |
| 37 | ------------- |
| 38 | The qcow2 format uses a two-level structure to map the virtual disk as |
| 39 | seen by the guest to the disk image in the host. These structures are |
| 40 | called the L1 and L2 tables. |
| 41 | |
| 42 | There is one single L1 table per disk image. The table is small and is |
| 43 | always kept in memory. |
| 44 | |
| 45 | There can be many L2 tables, depending on how much space has been |
| 46 | allocated in the image. Each table is one cluster in size. In order to |
| 47 | read or write data from the virtual disk, QEMU needs to read its |
| 48 | corresponding L2 table to find out where that data is located. Since |
| 49 | reading the table for each I/O operation can be expensive, QEMU keeps |
| 50 | an L2 cache in memory to speed up disk access. |
| 51 | |
| 52 | The size of the L2 cache can be configured, and setting the right |
| 53 | value can improve the I/O performance significantly. |
| 54 | |
| 55 | |
| 56 | The refcount blocks |
| 57 | ------------------- |
Like Xu | 806be37 | 2019-02-20 13:27:26 +0800 | [diff] [blame] | 58 | The qcow2 format also maintains a reference count for each cluster. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 59 | Reference counts are used for cluster allocation and internal |
| 60 | snapshots. The data is stored in a two-level structure similar to the |
| 61 | L1/L2 tables described above. |
| 62 | |
| 63 | The second level structures are called refcount blocks, are also one |
| 64 | cluster in size and the number is also variable and dependent on the |
| 65 | amount of allocated space. |
| 66 | |
| 67 | Each block contains a number of refcount entries. Their size (in bits) |
| 68 | is a power of two and must not be higher than 64. It defaults to 16 |
| 69 | bits, but a different value can be set using the refcount_bits option: |
| 70 | |
| 71 | qemu-img create -f qcow2 -o refcount_bits=8 hd.qcow2 4G |
| 72 | |
| 73 | QEMU keeps a refcount cache to speed up I/O much like the |
| 74 | aforementioned L2 cache, and its size can also be configured. |
| 75 | |
| 76 | |
| 77 | Choosing the right cache sizes |
| 78 | ------------------------------ |
| 79 | In order to choose the cache sizes we need to know how they relate to |
| 80 | the amount of allocated space. |
| 81 | |
Leonid Bloch | 40fb215 | 2018-09-26 19:04:39 +0300 | [diff] [blame] | 82 | The part of the virtual disk that can be mapped by the L2 and refcount |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 83 | caches (in bytes) is: |
| 84 | |
| 85 | disk_size = l2_cache_size * cluster_size / 8 |
| 86 | disk_size = refcount_cache_size * cluster_size * 8 / refcount_bits |
| 87 | |
| 88 | With the default values for cluster_size (64KB) and refcount_bits |
Leonid Bloch | 40fb215 | 2018-09-26 19:04:39 +0300 | [diff] [blame] | 89 | (16), this becomes: |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 90 | |
| 91 | disk_size = l2_cache_size * 8192 |
| 92 | disk_size = refcount_cache_size * 32768 |
| 93 | |
| 94 | So in order to cover n GB of disk space with the default values we |
| 95 | need: |
| 96 | |
| 97 | l2_cache_size = disk_size_GB * 131072 |
| 98 | refcount_cache_size = disk_size_GB * 32768 |
| 99 | |
Leonid Bloch | 40fb215 | 2018-09-26 19:04:39 +0300 | [diff] [blame] | 100 | For example, 1MB of L2 cache is needed to cover every 8 GB of the virtual |
| 101 | image size (given that the default cluster size is used): |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 102 | |
Leonid Bloch | 40fb215 | 2018-09-26 19:04:39 +0300 | [diff] [blame] | 103 | 8 GB / 8192 = 1 MB |
| 104 | |
| 105 | The refcount cache is 4 times the cluster size by default. With the default |
| 106 | cluster size of 64 KB, it is 256 KB (262144 bytes). This is sufficient for |
| 107 | 8 GB of image size: |
| 108 | |
| 109 | 262144 * 32768 = 8 GB |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 110 | |
| 111 | |
| 112 | How to configure the cache sizes |
| 113 | -------------------------------- |
| 114 | Cache sizes can be configured using the -drive option in the |
| 115 | command-line, or the 'blockdev-add' QMP command. |
| 116 | |
| 117 | There are three options available, and all of them take bytes: |
| 118 | |
| 119 | "l2-cache-size": maximum size of the L2 table cache |
| 120 | "refcount-cache-size": maximum size of the refcount block cache |
| 121 | "cache-size": maximum size of both caches combined |
| 122 | |
Alberto Garcia | 603790e | 2018-04-17 15:37:05 +0300 | [diff] [blame] | 123 | There are a few things that need to be taken into account: |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 124 | |
Alberto Garcia | be82097 | 2018-02-19 16:54:59 +0200 | [diff] [blame] | 125 | - Both caches must have a size that is a multiple of the cluster size |
| 126 | (or the cache entry size: see "Using smaller cache sizes" below). |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 127 | |
Leonid Bloch | 80668d0 | 2018-09-26 19:04:44 +0300 | [diff] [blame] | 128 | - The maximum L2 cache size is 32 MB by default on Linux platforms (enough |
| 129 | for full coverage of 256 GB images, with the default cluster size). This |
| 130 | value can be modified using the "l2-cache-size" option. QEMU will not use |
| 131 | more memory than needed to hold all of the image's L2 tables, regardless |
| 132 | of this max. value. |
| 133 | On non-Linux platforms the maximal value is smaller by default (8 MB) and |
| 134 | this difference stems from the fact that on Linux the cache can be cleared |
| 135 | periodically if needed, using the "cache-clean-interval" option (see below). |
| 136 | The minimal L2 cache size is 2 clusters (or 2 cache entries, see below). |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 137 | |
Alberto Garcia | 603790e | 2018-04-17 15:37:05 +0300 | [diff] [blame] | 138 | - The default (and minimum) refcount cache size is 4 clusters. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 139 | |
Alberto Garcia | 603790e | 2018-04-17 15:37:05 +0300 | [diff] [blame] | 140 | - If only "cache-size" is specified then QEMU will assign as much |
| 141 | memory as possible to the L2 cache before increasing the refcount |
| 142 | cache size. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 143 | |
Leonid Bloch | 40fb215 | 2018-09-26 19:04:39 +0300 | [diff] [blame] | 144 | - At most two of "l2-cache-size", "refcount-cache-size", and "cache-size" |
| 145 | can be set simultaneously. |
| 146 | |
Alberto Garcia | 603790e | 2018-04-17 15:37:05 +0300 | [diff] [blame] | 147 | Unlike L2 tables, refcount blocks are not used during normal I/O but |
| 148 | only during allocations and internal snapshots. In most cases they are |
| 149 | accessed sequentially (even during random guest I/O) so increasing the |
| 150 | refcount cache size won't have any measurable effect in performance |
| 151 | (this can change if you are using internal snapshots, so you may want |
| 152 | to think about increasing the cache size if you use them heavily). |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 153 | |
Alberto Garcia | 603790e | 2018-04-17 15:37:05 +0300 | [diff] [blame] | 154 | Before QEMU 2.12 the refcount cache had a default size of 1/4 of the |
| 155 | L2 cache size. This resulted in unnecessarily large caches, so now the |
| 156 | refcount cache is as small as possible unless overridden by the user. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 157 | |
| 158 | |
Alberto Garcia | be82097 | 2018-02-19 16:54:59 +0200 | [diff] [blame] | 159 | Using smaller cache entries |
| 160 | --------------------------- |
Alberto Garcia | af39bd0 | 2019-02-13 18:48:53 +0200 | [diff] [blame] | 161 | The qcow2 L2 cache can store complete tables. This means that if QEMU |
| 162 | needs an entry from an L2 table then the whole table is read from disk |
| 163 | and is kept in the cache. If the cache is full then a complete table |
| 164 | needs to be evicted first. |
Alberto Garcia | be82097 | 2018-02-19 16:54:59 +0200 | [diff] [blame] | 165 | |
| 166 | This can be inefficient with large cluster sizes since it results in |
| 167 | more disk I/O and wastes more cache memory. |
| 168 | |
| 169 | Since QEMU 2.12 you can change the size of the L2 cache entry and make |
| 170 | it smaller than the cluster size. This can be configured using the |
| 171 | "l2-cache-entry-size" parameter: |
| 172 | |
| 173 | -drive file=hd.qcow2,l2-cache-size=2097152,l2-cache-entry-size=4096 |
| 174 | |
Alberto Garcia | af39bd0 | 2019-02-13 18:48:53 +0200 | [diff] [blame] | 175 | Since QEMU 4.0 the value of l2-cache-entry-size defaults to 4KB (or |
| 176 | the cluster size if it's smaller). |
| 177 | |
Alberto Garcia | be82097 | 2018-02-19 16:54:59 +0200 | [diff] [blame] | 178 | Some things to take into account: |
| 179 | |
| 180 | - The L2 cache entry size has the same restrictions as the cluster |
| 181 | size (power of two, at least 512 bytes). |
| 182 | |
| 183 | - Smaller entry sizes generally improve the cache efficiency and make |
| 184 | disk I/O faster. This is particularly true with solid state drives |
| 185 | so it's a good idea to reduce the entry size in those cases. With |
| 186 | rotating hard drives the situation is a bit more complicated so you |
| 187 | should test it first and stay with the default size if unsure. |
| 188 | |
| 189 | - Try different entry sizes to see which one gives faster performance |
| 190 | in your case. The block size of the host filesystem is generally a |
Alberto Garcia | af39bd0 | 2019-02-13 18:48:53 +0200 | [diff] [blame] | 191 | good default (usually 4096 bytes in the case of ext4, hence the |
| 192 | default). |
Alberto Garcia | be82097 | 2018-02-19 16:54:59 +0200 | [diff] [blame] | 193 | |
| 194 | - Only the L2 cache can be configured this way. The refcount cache |
| 195 | always uses the cluster size as the entry size. |
| 196 | |
| 197 | - If the L2 cache is big enough to hold all of the image's L2 tables |
Leonid Bloch | b749562 | 2018-09-26 19:04:43 +0300 | [diff] [blame] | 198 | (as explained in the "Choosing the right cache sizes" and "How to |
| 199 | configure the cache sizes" sections in this document) then none of |
| 200 | this is necessary and you can omit the "l2-cache-entry-size" |
Alberto Garcia | af39bd0 | 2019-02-13 18:48:53 +0200 | [diff] [blame] | 201 | parameter altogether. In this case QEMU makes the entry size |
| 202 | equal to the cluster size by default. |
Alberto Garcia | be82097 | 2018-02-19 16:54:59 +0200 | [diff] [blame] | 203 | |
| 204 | |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 205 | Reducing the memory usage |
| 206 | ------------------------- |
| 207 | It is possible to clean unused cache entries in order to reduce the |
| 208 | memory usage during periods of low I/O activity. |
| 209 | |
Leonid Bloch | e3a7b45 | 2018-09-29 12:54:54 +0300 | [diff] [blame] | 210 | The parameter "cache-clean-interval" defines an interval (in seconds), |
| 211 | after which all the cache entries that haven't been accessed during the |
| 212 | interval are removed from memory. Setting this parameter to 0 disables this |
| 213 | feature. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 214 | |
Leonid Bloch | e3a7b45 | 2018-09-29 12:54:54 +0300 | [diff] [blame] | 215 | The following example removes all unused cache entries every 15 minutes: |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 216 | |
| 217 | -drive file=hd.qcow2,cache-clean-interval=900 |
| 218 | |
Leonid Bloch | e3a7b45 | 2018-09-29 12:54:54 +0300 | [diff] [blame] | 219 | If unset, the default value for this parameter is 600 on platforms which |
| 220 | support this functionality, and is 0 (disabled) on other platforms. |
Alberto Garcia | 7f65ce8 | 2015-08-04 15:14:41 +0300 | [diff] [blame] | 221 | |
Leonid Bloch | e3a7b45 | 2018-09-29 12:54:54 +0300 | [diff] [blame] | 222 | This functionality currently relies on the MADV_DONTNEED argument for |
| 223 | madvise() to actually free the memory. This is a Linux-specific feature, |
| 224 | so cache-clean-interval is not supported on other systems. |
Alberto Garcia | 30afc12 | 2020-07-10 18:12:49 +0200 | [diff] [blame] | 225 | |
| 226 | |
| 227 | Extended L2 Entries |
| 228 | ------------------- |
| 229 | All numbers shown in this document are valid for qcow2 images with normal |
| 230 | 64-bit L2 entries. |
| 231 | |
| 232 | Images with extended L2 entries need twice as much L2 metadata, so the L2 |
| 233 | cache size must be twice as large for the same disk space. |
| 234 | |
| 235 | disk_size = l2_cache_size * cluster_size / 16 |
| 236 | |
| 237 | i.e. |
| 238 | |
| 239 | l2_cache_size = disk_size * 16 / cluster_size |
| 240 | |
| 241 | Refcount blocks are not affected by this. |