docs/interop/qed_spec.rst - qemu - Git at Google

 ===================================
 QED Image File Format Specification
 ===================================

 The file format looks like this::

  +----------+----------+----------+-----+
  | cluster0 | cluster1 | cluster2 | ... |
  +----------+----------+----------+-----+

 The first cluster begins with the ``header``. The header contains information
 about where regular clusters start; this allows the header to be extensible and
 store extra information about the image file. A regular cluster may be
 a ``data cluster``, an ``L2``, or an ``L1 table``. L1 and L2 tables are composed
 of one or more contiguous clusters.

 Normally the file size will be a multiple of the cluster size.  If the file size
 is not a multiple, extra information after the last cluster may not be preserved
 if data is written. Legitimate extra information should use space between the header
 and the first regular cluster.

 All fields are little-endian.

 Header
 ------

 ::

   Header {
      uint32_t magic;               /* QED\0 */

      uint32_t cluster_size;        /* in bytes */
      uint32_t table_size;          /* for L1 and L2 tables, in clusters */
      uint32_t header_size;         /* in clusters */

      uint64_t features;            /* format feature bits */
      uint64_t compat_features;     /* compat feature bits */
      uint64_t autoclear_features;  /* self-resetting feature bits */

      uint64_t l1_table_offset;     /* in bytes */
      uint64_t image_size;          /* total logical image size, in bytes */

      /* if (features & QED_F_BACKING_FILE) */
      uint32_t backing_filename_offset; /* in bytes from start of header */
      uint32_t backing_filename_size;   /* in bytes */
   }

 Field descriptions:
 ~~~~~~~~~~~~~~~~~~~

 - ``cluster_size`` must be a power of 2 in range [2^12, 2^26].
 - ``table_size`` must be a power of 2 in range [1, 16].
 - ``header_size`` is the number of clusters used by the header and any additional
   information stored before regular clusters.
 - ``features``, ``compat_features``, and ``autoclear_features`` are file format
   extension bitmaps. They work as follows:

   - An image with unknown ``features`` bits enabled must not be opened. File format
     changes that are not backwards-compatible must use ``features`` bits.
   - An image with unknown ``compat_features`` bits enabled can be opened safely.
     The unknown features are simply ignored and represent backwards-compatible
     changes to the file format.
   - An image with unknown ``autoclear_features`` bits enable can be opened safely
     after clearing the unknown bits. This allows for backwards-compatible changes
     to the file format which degrade gracefully and can be re-enabled again by a
     new program later.
 - ``l1_table_offset`` is the offset of the first byte of the L1 table in the image
   file and must be a multiple of ``cluster_size``.
 - ``image_size`` is the block device size seen by the guest and must be a multiple
   of 512 bytes.
 - ``backing_filename_offset`` and ``backing_filename_size`` describe a string in
   (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints.
   The string must be stored within the first ``header_size`` clusters. The backing filename
   may be an absolute path or relative to the image file.

 Feature bits:
 ~~~~~~~~~~~~~

 - ``QED_F_BACKING_FILE = 0x01``. The image uses a backing file.
 - ``QED_F_NEED_CHECK = 0x02``. The image needs a consistency check before use.
 - ``QED_F_BACKING_FORMAT_NO_PROBE = 0x04``. The backing file is a raw disk image
   and no file format autodetection should be attempted.  This should be used to
   ensure that raw backing files are never detected as an image format if they happen
   to contain magic constants.

 There are currently no defined ``compat_features`` or ``autoclear_features`` bits.

 Fields predicated on a feature bit are only used when that feature is set.
 The fields always take up header space, regardless of whether or not the feature
 bit is set.

 Tables
 ------

 Tables provide the translation from logical offsets in the block device to cluster
 offsets in the file.

 ::

  #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))

  Table {
      uint64_t offsets[TABLE_NOFFSETS];
  }

 The tables are organized as follows::

                     +----------+
                     | L1 table |
                     +----------+
                ,------'  |  '------.
           +----------+   |    +----------+
           | L2 table |  ...   | L2 table |
           +----------+        +----------+
       ,------'  |  '------.
  +----------+   |    +----------+
  |   Data   |  ...   |   Data   |
  +----------+        +----------+

 A table is made up of one or more contiguous clusters.  The ``table_size`` header
 field determines table size for an image file. For example, ``cluster_size=64 KB``
 and ``table_size=4`` results in 256 KB tables.

 The logical image size must be less than or equal to the maximum possible size of
 clusters rooted by the L1 table:

 .. code::

  header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size

 L1, L2, and data cluster offsets must be aligned to ``header.cluster_size``.
 The following offsets have special meanings:

 L2 table offsets
 ~~~~~~~~~~~~~~~~

 - 0 - unallocated. The L2 table is not yet allocated.

 Data cluster offsets
 ~~~~~~~~~~~~~~~~~~~~

 - 0 - unallocated.  The data cluster is not yet allocated.
 - 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.

 Future format extensions may wish to store per-offset information. The least
 significant 12 bits of an offset are reserved for this purpose and must be set
 to zero. Image files with ``cluster_size`` > 2^12 will have more unused bits
 which should also be zeroed.

 Unallocated L2 tables and data clusters
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Reads to an unallocated area of the image file access the backing file. If there
 is no backing file, then zeroes are produced. The backing file may be smaller
 than the image file and reads of unallocated areas beyond the end of the backing
 file produce zeroes.

 Writes to an unallocated area cause a new data clusters to be allocated, and a new
 L2 table if that is also unallocated. The new data cluster is populated with data
 from the backing file (or zeroes if no backing file) and the data being written.

 Zero data clusters
 ~~~~~~~~~~~~~~~~~~

 Zero data clusters are a space-efficient way of storing zeroed regions of the image.

 Reads to a zero data cluster produce zeroes.

 .. note::
     The difference between an unallocated and a zero data cluster is that zero data
     clusters stop the reading of contents from the backing file.

 Writes to a zero data cluster cause a new data cluster to be allocated.  The new
 data cluster is populated with zeroes and the data being written.

 Logical offset translation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 Logical offsets are translated into cluster offsets as follows::

   table_bits table_bits    cluster_bits
   <--------> <--------> <--------------->
  +----------+----------+-----------------+
  | L1 index | L2 index |     byte offset |
  +----------+----------+-----------------+

        Structure of a logical offset

  offset_mask = ~(cluster_size - 1) # mask for the image file byte offset

  def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
    l2_offset = l1_table[l1_index]
    l2_table = load_table(l2_offset)
    cluster_offset = l2_table[l2_index] & offset_mask
    return cluster_offset + byte_offset

 Consistency checking
 --------------------

 This section is informational and included to provide background on the use
 of the ``QED_F_NEED_CHECK features`` bit.

 The ``QED_F_NEED_CHECK`` bit is used to mark an image as dirty before starting
 an operation that could leave the image in an inconsistent state if interrupted
 by a crash or power failure.  A dirty image must be checked on open because its
 metadata may not be consistent.

 Consistency check includes the following invariants:

 - Each cluster is referenced once and only once. It is an inconsistency to have
   a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked
   if it has no references.
 - Offsets must be within the image file size and must be ``cluster_size`` aligned.
 - Table offsets must at least ``table_size`` * ``cluster_size`` bytes from the end
   of the image file so that there is space for the entire table.

 The consistency check process starts from ``l1_table_offset`` and scans all L2 tables.
 After the check completes with no other errors besides leaks, the ``QED_F_NEED_CHECK``
 bit can be cleared and the image can be accessed.
	===================================
	QED Image File Format Specification
	===================================

	The file format looks like this::

	+----------+----------+----------+-----+
	\| cluster0 \| cluster1 \| cluster2 \| ... \|
	+----------+----------+----------+-----+

	The first cluster begins with the ``header``. The header contains information
	about where regular clusters start; this allows the header to be extensible and
	store extra information about the image file. A regular cluster may be
	a ``data cluster``, an ``L2``, or an ``L1 table``. L1 and L2 tables are composed
	of one or more contiguous clusters.

	Normally the file size will be a multiple of the cluster size. If the file size
	is not a multiple, extra information after the last cluster may not be preserved
	if data is written. Legitimate extra information should use space between the header
	and the first regular cluster.

	All fields are little-endian.

	Header
	------

	::

	Header {
	uint32_t magic; /* QED\0 */

	uint32_t cluster_size; /* in bytes */
	uint32_t table_size; /* for L1 and L2 tables, in clusters */
	uint32_t header_size; /* in clusters */

	uint64_t features; /* format feature bits */
	uint64_t compat_features; /* compat feature bits */
	uint64_t autoclear_features; /* self-resetting feature bits */

	uint64_t l1_table_offset; /* in bytes */
	uint64_t image_size; /* total logical image size, in bytes */

	/* if (features & QED_F_BACKING_FILE) */
	uint32_t backing_filename_offset; /* in bytes from start of header */
	uint32_t backing_filename_size; /* in bytes */
	}

	Field descriptions:
	~~~~~~~~~~~~~~~~~~~

	- ``cluster_size`` must be a power of 2 in range [2^12, 2^26].
	- ``table_size`` must be a power of 2 in range [1, 16].
	- ``header_size`` is the number of clusters used by the header and any additional
	information stored before regular clusters.
	- ``features``, ``compat_features``, and ``autoclear_features`` are file format
	extension bitmaps. They work as follows:

	- An image with unknown ``features`` bits enabled must not be opened. File format
	changes that are not backwards-compatible must use ``features`` bits.
	- An image with unknown ``compat_features`` bits enabled can be opened safely.
	The unknown features are simply ignored and represent backwards-compatible
	changes to the file format.
	- An image with unknown ``autoclear_features`` bits enable can be opened safely
	after clearing the unknown bits. This allows for backwards-compatible changes
	to the file format which degrade gracefully and can be re-enabled again by a
	new program later.
	- ``l1_table_offset`` is the offset of the first byte of the L1 table in the image
	file and must be a multiple of ``cluster_size``.
	- ``image_size`` is the block device size seen by the guest and must be a multiple
	of 512 bytes.
	- ``backing_filename_offset`` and ``backing_filename_size`` describe a string in
	(byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints.
	The string must be stored within the first ``header_size`` clusters. The backing filename
	may be an absolute path or relative to the image file.

	Feature bits:
	~~~~~~~~~~~~~

	- ``QED_F_BACKING_FILE = 0x01``. The image uses a backing file.
	- ``QED_F_NEED_CHECK = 0x02``. The image needs a consistency check before use.
	- ``QED_F_BACKING_FORMAT_NO_PROBE = 0x04``. The backing file is a raw disk image
	and no file format autodetection should be attempted. This should be used to
	ensure that raw backing files are never detected as an image format if they happen
	to contain magic constants.

	There are currently no defined ``compat_features`` or ``autoclear_features`` bits.

	Fields predicated on a feature bit are only used when that feature is set.
	The fields always take up header space, regardless of whether or not the feature
	bit is set.

	Tables
	------

	Tables provide the translation from logical offsets in the block device to cluster
	offsets in the file.

	::

	#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))

	Table {
	uint64_t offsets[TABLE_NOFFSETS];
	}

	The tables are organized as follows::

	+----------+
	\| L1 table \|
	+----------+
	,------' \| '------.
	+----------+ \| +----------+
	\| L2 table \| ... \| L2 table \|
	+----------+ +----------+
	,------' \| '------.
	+----------+ \| +----------+
	\| Data \| ... \| Data \|
	+----------+ +----------+

	A table is made up of one or more contiguous clusters. The ``table_size`` header
	field determines table size for an image file. For example, ``cluster_size=64 KB``
	and ``table_size=4`` results in 256 KB tables.

	The logical image size must be less than or equal to the maximum possible size of
	clusters rooted by the L1 table:

	.. code::

	header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size

	L1, L2, and data cluster offsets must be aligned to ``header.cluster_size``.
	The following offsets have special meanings:

	L2 table offsets
	~~~~~~~~~~~~~~~~

	- 0 - unallocated. The L2 table is not yet allocated.

	Data cluster offsets
	~~~~~~~~~~~~~~~~~~~~

	- 0 - unallocated. The data cluster is not yet allocated.
	- 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.

	Future format extensions may wish to store per-offset information. The least
	significant 12 bits of an offset are reserved for this purpose and must be set
	to zero. Image files with ``cluster_size`` > 2^12 will have more unused bits
	which should also be zeroed.

	Unallocated L2 tables and data clusters
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Reads to an unallocated area of the image file access the backing file. If there
	is no backing file, then zeroes are produced. The backing file may be smaller
	than the image file and reads of unallocated areas beyond the end of the backing
	file produce zeroes.

	Writes to an unallocated area cause a new data clusters to be allocated, and a new
	L2 table if that is also unallocated. The new data cluster is populated with data
	from the backing file (or zeroes if no backing file) and the data being written.

	Zero data clusters
	~~~~~~~~~~~~~~~~~~

	Zero data clusters are a space-efficient way of storing zeroed regions of the image.

	Reads to a zero data cluster produce zeroes.

	.. note::
	The difference between an unallocated and a zero data cluster is that zero data
	clusters stop the reading of contents from the backing file.

	Writes to a zero data cluster cause a new data cluster to be allocated. The new
	data cluster is populated with zeroes and the data being written.

	Logical offset translation
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	Logical offsets are translated into cluster offsets as follows::

	table_bits table_bits cluster_bits
	<--------> <--------> <--------------->
	+----------+----------+-----------------+
	\| L1 index \| L2 index \| byte offset \|
	+----------+----------+-----------------+

	Structure of a logical offset

	offset_mask = ~(cluster_size - 1) # mask for the image file byte offset

	def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
	l2_offset = l1_table[l1_index]
	l2_table = load_table(l2_offset)
	cluster_offset = l2_table[l2_index] & offset_mask
	return cluster_offset + byte_offset

	Consistency checking
	--------------------

	This section is informational and included to provide background on the use
	of the ``QED_F_NEED_CHECK features`` bit.

	The ``QED_F_NEED_CHECK`` bit is used to mark an image as dirty before starting
	an operation that could leave the image in an inconsistent state if interrupted
	by a crash or power failure. A dirty image must be checked on open because its
	metadata may not be consistent.

	Consistency check includes the following invariants:

	- Each cluster is referenced once and only once. It is an inconsistency to have
	a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked
	if it has no references.
	- Offsets must be within the image file size and must be ``cluster_size`` aligned.
	- Table offsets must at least ``table_size`` * ``cluster_size`` bytes from the end
	of the image file so that there is space for the entire table.

	The consistency check process starts from ``l1_table_offset`` and scans all L2 tables.
	After the check completes with no other errors besides leaks, the ``QED_F_NEED_CHECK``
	bit can be cleared and the image can be accessed.