docs/devel/migration/compatibility.rst - qemu - Git at Google

 Backwards compatibility
 =======================

 How backwards compatibility works
 ---------------------------------

 When we do migration, we have two QEMU processes: the source and the
 target.  There are two cases, they are the same version or they are
 different versions.  The easy case is when they are the same version.
 The difficult one is when they are different versions.

 There are two things that are different, but they have very similar
 names and sometimes get confused:

 - QEMU version
 - machine type version

 Let's start with a practical example, we start with:

 - qemu-system-x86_64 (v5.2), from now on qemu-5.2.
 - qemu-system-x86_64 (v5.1), from now on qemu-5.1.

 Related to this are the "latest" machine types defined on each of
 them:

 - pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
 - pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1

 First of all, migration is only supposed to work if you use the same
 machine type in both source and destination. The QEMU hardware
 configuration needs to be the same also on source and destination.
 Most aspects of the backend configuration can be changed at will,
 except for a few cases where the backend features influence frontend
 device feature exposure.  But that is not relevant for this section.

 I am going to list the number of combinations that we can have.  Let's
 start with the trivial ones, QEMU is the same on source and
 destination:

 1 - qemu-5.2 -M pc-5.2  -> migrates to -> qemu-5.2 -M pc-5.2

   This is the latest QEMU with the latest machine type.
   This have to work, and if it doesn't work it is a bug.

 2 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1

   Exactly the same case than the previous one, but for 5.1.
   Nothing to see here either.

 This are the easiest ones, we will not talk more about them in this
 section.

 Now we start with the more interesting cases.  Consider the case where
 we have the same QEMU version in both sides (qemu-5.2) but we are using
 the latest machine type for that version (pc-5.2) but one of an older
 QEMU version, in this case pc-5.1.

 3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1

   It needs to use the definition of pc-5.1 and the devices as they
   were configured on 5.1, but this should be easy in the sense that
   both sides are the same QEMU and both sides have exactly the same
   idea of what the pc-5.1 machine is.

 4 - qemu-5.1 -M pc-5.2  -> migrates to -> qemu-5.1 -M pc-5.2

   This combination is not possible as the qemu-5.1 doesn't understand
   pc-5.2 machine type.  So nothing to worry here.

 Now it comes the interesting ones, when both QEMU processes are
 different.  Notice also that the machine type needs to be pc-5.1,
 because we have the limitation than qemu-5.1 doesn't know pc-5.2.  So
 the possible cases are:

 5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1

   This migration is known as newer to older.  We need to make sure
   when we are developing 5.2 we need to take care about not to break
   migration to qemu-5.1.  Notice that we can't make updates to
   qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
   in qemu-5.2 side to make the relevant changes.

 6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1

   This migration is known as older to newer.  We need to make sure
   than we are able to receive migrations from qemu-5.1. The problem is
   similar to the previous one.

 If qemu-5.1 and qemu-5.2 were the same, there will not be any
 compatibility problems.  But the reason that we create qemu-5.2 is to
 get new features, devices, defaults, etc.

 If we get a device that has a new feature, or change a default value,
 we have a problem when we try to migrate between different QEMU
 versions.

 So we need a way to tell qemu-5.2 that when we are using machine type
 pc-5.1, it needs to **not** use the feature, to be able to migrate to
 real qemu-5.1.

 And the equivalent part when migrating from qemu-5.1 to qemu-5.2.
 qemu-5.2 has to expect that it is not going to get data for the new
 feature, because qemu-5.1 doesn't know about it.

 How do we tell QEMU about these device feature changes?  In
 hw/core/machine.c:hw_compat_X_Y arrays.

 If we change a default value, we need to put back the old value on
 that array.  And the device, during initialization needs to look at
 that array to see what value it needs to get for that feature.  And
 what are we going to put in that array, the value of a property.

 To create a property for a device, we need to use one of the
 DEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
 macros that exist.  With it, we set the default value for that
 property, and that is what it is going to get in the latest released
 version.  But if we want a different value for a previous version, we
 can change that in the hw_compat_X_Y arrays.

 hw_compat_X_Y is an array of registers that have the format:

 - name_device
 - name_property
 - value

 Let's see a practical example.

 In qemu-5.2 virtio-blk-device got multi queue support.  This is a
 change that is not backward compatible.  In qemu-5.1 it has one
 queue. In qemu-5.2 it has the same number of queues as the number of
 cpus in the system.

 When we are doing migration, if we migrate from a device that has 4
 queues to a device that have only one queue, we don't know where to
 put the extra information for the other 3 queues, and we fail
 migration.

 Similar problem when we migrate from qemu-5.1 that has only one queue
 to qemu-5.2, we only sent information for one queue, but destination
 has 4, and we have 3 queues that are not properly initialized and
 anything can happen.

 So, how can we address this problem.  Easy, just convince qemu-5.2
 that when it is running pc-5.1, it needs to set the number of queues
 for virtio-blk-devices to 1.

 That way we fix the cases 5 and 6.

 5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1

     qemu-5.2 -M pc-5.1 sets number of queues to be 1.
     qemu-5.1 -M pc-5.1 expects number of queues to be 1.

     correct.  migration works.

 6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1

     qemu-5.1 -M pc-5.1 sets number of queues to be 1.
     qemu-5.2 -M pc-5.1 expects number of queues to be 1.

     correct.  migration works.

 And now the other interesting case, case 3.  In this case we have:

 3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1

     Here we have the same QEMU in both sides.  So it doesn't matter a
     lot if we have set the number of queues to 1 or not, because
     they are the same.

     WRONG!

     Think what happens if we do one of this double migrations:

     A -> migrates -> B -> migrates -> C

     where:

     A: qemu-5.1 -M pc-5.1
     B: qemu-5.2 -M pc-5.1
     C: qemu-5.2 -M pc-5.1

     migration A -> B is case 6, so number of queues needs to be 1.

     migration B -> C is case 3, so we don't care.  But actually we
     care because we haven't started the guest in qemu-5.2, it came
     migrated from qemu-5.1.  So to be in the safe place, we need to
     always use number of queues 1 when we are using pc-5.1.

 Now, how was this done in reality?  The following commit shows how it
 was done::

   commit 9445e1e15e66c19e42bea942ba810db28052cd05
   Author: Stefan Hajnoczi <stefanha@redhat.com>
   Date:   Tue Aug 18 15:33:47 2020 +0100

   virtio-blk-pci: default num_queues to -smp N

 The relevant parts for migration are::

     @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
      #endif
          DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
                          true),
     -    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
     +    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
     +                       VIRTIO_BLK_AUTO_NUM_QUEUES),
          DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),

 It changes the default value of num_queues.  But it fishes it for old
 machine types to have the right value::

     @@ -31,6 +31,7 @@
      GlobalProperty hw_compat_5_1[] = {
          ...
     +    { "virtio-blk-device", "num-queues", "1"},
          ...
      };

 A device with different features on both sides
 ----------------------------------------------

 Let's assume that we are using the same QEMU binary on both sides,
 just to make the things easier.  But we have a device that has
 different features on both sides of the migration.  That can be
 because the devices are different, because the kernel driver of both
 devices have different features, whatever.

 How can we get this to work with migration.  The way to do that is
 "theoretically" easy.  You have to get the features that the device
 has in the source of the migration.  The features that the device has
 on the target of the migration, you get the intersection of the
 features of both sides, and that is the way that you should launch
 QEMU.

 Notice that this is not completely related to QEMU.  The most
 important thing here is that this should be handled by the managing
 application that launches QEMU.  If QEMU is configured correctly, the
 migration will succeed.

 That said, actually doing it is complicated.  Almost all devices are
 bad at being able to be launched with only some features enabled.
 With one big exception: cpus.

 You can read the documentation for QEMU x86 cpu models here:

 https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html

 See when they talk about migration they recommend that one chooses the
 newest cpu model that is supported for all cpus.

 Let's say that we have:

 Host A:

 Device X has the feature Y

 Host B:

 Device X has not the feature Y

 If we try to migrate without any care from host A to host B, it will
 fail because when migration tries to load the feature Y on
 destination, it will find that the hardware is not there.

 Doing this would be the equivalent of doing with cpus:

 Host A:

 $ qemu-system-x86_64 -cpu host

 Host B:

 $ qemu-system-x86_64 -cpu host

 When both hosts have different cpu features this is guaranteed to
 fail.  Especially if Host B has less features than host A.  If host A
 has less features than host B, sometimes it works.  Important word of
 last sentence is "sometimes".

 So, forgetting about cpu models and continuing with the -cpu host
 example, let's see that the differences of the cpus is that Host A and
 B have the following features:

 Features:   'pcid'  'stibp' 'taa-no'
 Host A:        X       X
 Host B:                        X

 And we want to migrate between them, the way configure both QEMU cpu
 will be:

 Host A:

 $ qemu-system-x86_64 -cpu host,pcid=off,stibp=off

 Host B:

 $ qemu-system-x86_64 -cpu host,taa-no=off

 And you would be able to migrate between them.  It is responsibility
 of the management application or of the user to make sure that the
 configuration is correct.  QEMU doesn't know how to look at this kind
 of features in general.

 Notice that we don't recommend to use -cpu host for migration.  It is
 used in this example because it makes the example simpler.

 Other devices have worse control about individual features.  If they
 want to be able to migrate between hosts that show different features,
 the device needs a way to configure which ones it is going to use.

 In this section we have considered that we are using the same QEMU
 binary in both sides of the migration.  If we use different QEMU
 versions process, then we need to have into account all other
 differences and the examples become even more complicated.

 How to mitigate when we have a backward compatibility error
 -----------------------------------------------------------

 We broke migration for old machine types continuously during
 development.  But as soon as we find that there is a problem, we fix
 it.  The problem is what happens when we detect after we have done a
 release that something has gone wrong.

 Let see how it worked with one example.

 After the release of qemu-8.0 we found a problem when doing migration
 of the machine type pc-7.2.

 - $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2

   This migration works

 - $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2

   This migration works

 - $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2

   This migration fails

 - $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2

   This migration fails

 So clearly something fails when migration between qemu-7.2 and
 qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
 pointed to this commit.

 In qemu-8.0 we got this commit::

     commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
     Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
     Date:   Thu Mar 2 13:37:02 2023 +0000

     hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register


 The relevant bits of the commit for our example are this ones::

     --- a/hw/pci/pcie_aer.c
     +++ b/hw/pci/pcie_aer.c
     @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,

          pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
                       PCI_ERR_UNC_SUPPORTED);
     +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
     +                 PCI_ERR_UNC_MASK_DEFAULT);
     +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
     +                 PCI_ERR_UNC_SUPPORTED);

          pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
                      PCI_ERR_UNC_SEVERITY_DEFAULT);

 The patch changes how we configure PCI space for AER.  But QEMU fails
 when the PCI space configuration is different between source and
 destination.

 The following commit shows how this got fixed::

     commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
     Author: Leonardo Bras <leobras@redhat.com>
     Date:   Tue May 2 21:27:02 2023 -0300

     hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0

     [...]

 The relevant parts of the fix in QEMU are as follow:

 First, we create a new property for the device to be able to configure
 the old behaviour or the new behaviour::

     diff --git a/hw/pci/pci.c b/hw/pci/pci.c
     index 8a87ccc8b0..5153ad63d6 100644
     --- a/hw/pci/pci.c
     +++ b/hw/pci/pci.c
     @@ -79,6 +79,8 @@ static Property pci_props[] = {
          DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
                             failover_pair_id),
          DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
     +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
     +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
          DEFINE_PROP_END_OF_LIST()
      };

 Notice that we enable the feature for new machine types.

 Now we see how the fix is done.  This is going to depend on what kind
 of breakage happens, but in this case it is quite simple::

     diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
     index 103667c368..374d593ead 100644
     --- a/hw/pci/pcie_aer.c
     +++ b/hw/pci/pcie_aer.c
     @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
     uint16_t offset,

          pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
                       PCI_ERR_UNC_SUPPORTED);
     -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
     -                 PCI_ERR_UNC_MASK_DEFAULT);
     -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
     -                 PCI_ERR_UNC_SUPPORTED);
     +
     +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
     +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
     +                     PCI_ERR_UNC_MASK_DEFAULT);
     +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
     +                     PCI_ERR_UNC_SUPPORTED);
     +    }

          pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
                       PCI_ERR_UNC_SEVERITY_DEFAULT);

 I.e. If the property bit is enabled, we configure it as we did for
 qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.

 And now, everything that is missing is disabling the feature for old
 machine types::

     diff --git a/hw/core/machine.c b/hw/core/machine.c
     index 47a34841a5..07f763eb2e 100644
     --- a/hw/core/machine.c
     +++ b/hw/core/machine.c
     @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
          { "e1000e", "migrate-timadj", "off" },
          { "virtio-mem", "x-early-migration", "false" },
          { "migration", "x-preempt-pre-7-2", "true" },
     +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
      };
      const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);

 And now, when qemu-8.0.1 is released with this fix, all combinations
 are going to work as supposed.

 - $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
 - $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
 - $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
 - $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)

 So the normality has been restored and everything is ok, no?

 Not really, now our matrix is much bigger.  We started with the easy
 cases, migration from the same version to the same version always
 works:

 - $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
 - $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
 - $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2

 Now the interesting ones.  When the QEMU processes versions are
 different.  For the 1st set, their fail and we can do nothing, both
 versions are released and we can't change anything.

 - $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
 - $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2

 This two are the ones that work. The whole point of making the
 change in qemu-8.0.1 release was to fix this issue:

 - $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
 - $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2

 But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
 qemu-8.0.1.

 - $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
 - $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2

 So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
 anything except to qemu-8.0.

 Can we do better?

 Yeap.  If we know that we are going to do this migration:

 - $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2

 We can launch the appropriate devices with::

   --device...,x-pci-e-err-unc-mask=on

 And now we can receive a migration from 8.0.  And from now on, we can
 do that migration to new machine types if we remember to enable that
 property for pc-7.2.  Notice that we need to remember, it is not
 enough to know that the source of the migration is qemu-8.0.  Think of
 this example:

 $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2

 In the second migration, the source is not qemu-8.0, but we still have
 that "problem" and have that property enabled.  Notice that we need to
 continue having this mark/property until we have this machine
 rebooted.  But it is not a normal reboot (that don't reload QEMU) we
 need the machine to poweroff/poweron on a fixed QEMU.  And from now
 on we can use the proper real machine.
	Backwards compatibility
	=======================

	How backwards compatibility works
	---------------------------------

	When we do migration, we have two QEMU processes: the source and the
	target. There are two cases, they are the same version or they are
	different versions. The easy case is when they are the same version.
	The difficult one is when they are different versions.

	There are two things that are different, but they have very similar
	names and sometimes get confused:

	- QEMU version
	- machine type version

	Let's start with a practical example, we start with:

	- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
	- qemu-system-x86_64 (v5.1), from now on qemu-5.1.

	Related to this are the "latest" machine types defined on each of
	them:

	- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
	- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1

	First of all, migration is only supposed to work if you use the same
	machine type in both source and destination. The QEMU hardware
	configuration needs to be the same also on source and destination.
	Most aspects of the backend configuration can be changed at will,
	except for a few cases where the backend features influence frontend
	device feature exposure. But that is not relevant for this section.

	I am going to list the number of combinations that we can have. Let's
	start with the trivial ones, QEMU is the same on source and
	destination:

	1 - qemu-5.2 -M pc-5.2 -> migrates to -> qemu-5.2 -M pc-5.2

	This is the latest QEMU with the latest machine type.
	This have to work, and if it doesn't work it is a bug.

	2 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1

	Exactly the same case than the previous one, but for 5.1.
	Nothing to see here either.

	This are the easiest ones, we will not talk more about them in this
	section.

	Now we start with the more interesting cases. Consider the case where
	we have the same QEMU version in both sides (qemu-5.2) but we are using
	the latest machine type for that version (pc-5.2) but one of an older
	QEMU version, in this case pc-5.1.

	3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1

	It needs to use the definition of pc-5.1 and the devices as they
	were configured on 5.1, but this should be easy in the sense that
	both sides are the same QEMU and both sides have exactly the same
	idea of what the pc-5.1 machine is.

	4 - qemu-5.1 -M pc-5.2 -> migrates to -> qemu-5.1 -M pc-5.2

	This combination is not possible as the qemu-5.1 doesn't understand
	pc-5.2 machine type. So nothing to worry here.

	Now it comes the interesting ones, when both QEMU processes are
	different. Notice also that the machine type needs to be pc-5.1,
	because we have the limitation than qemu-5.1 doesn't know pc-5.2. So
	the possible cases are:

	5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1

	This migration is known as newer to older. We need to make sure
	when we are developing 5.2 we need to take care about not to break
	migration to qemu-5.1. Notice that we can't make updates to
	qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
	in qemu-5.2 side to make the relevant changes.

	6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1

	This migration is known as older to newer. We need to make sure
	than we are able to receive migrations from qemu-5.1. The problem is
	similar to the previous one.

	If qemu-5.1 and qemu-5.2 were the same, there will not be any
	compatibility problems. But the reason that we create qemu-5.2 is to
	get new features, devices, defaults, etc.

	If we get a device that has a new feature, or change a default value,
	we have a problem when we try to migrate between different QEMU
	versions.

	So we need a way to tell qemu-5.2 that when we are using machine type
	pc-5.1, it needs to not use the feature, to be able to migrate to
	real qemu-5.1.

	And the equivalent part when migrating from qemu-5.1 to qemu-5.2.
	qemu-5.2 has to expect that it is not going to get data for the new
	feature, because qemu-5.1 doesn't know about it.

	How do we tell QEMU about these device feature changes? In
	hw/core/machine.c:hw_compat_X_Y arrays.

	If we change a default value, we need to put back the old value on
	that array. And the device, during initialization needs to look at
	that array to see what value it needs to get for that feature. And
	what are we going to put in that array, the value of a property.

	To create a property for a device, we need to use one of the
	DEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
	macros that exist. With it, we set the default value for that
	property, and that is what it is going to get in the latest released
	version. But if we want a different value for a previous version, we
	can change that in the hw_compat_X_Y arrays.

	hw_compat_X_Y is an array of registers that have the format:

	- name_device
	- name_property
	- value

	Let's see a practical example.

	In qemu-5.2 virtio-blk-device got multi queue support. This is a
	change that is not backward compatible. In qemu-5.1 it has one
	queue. In qemu-5.2 it has the same number of queues as the number of
	cpus in the system.

	When we are doing migration, if we migrate from a device that has 4
	queues to a device that have only one queue, we don't know where to
	put the extra information for the other 3 queues, and we fail
	migration.

	Similar problem when we migrate from qemu-5.1 that has only one queue
	to qemu-5.2, we only sent information for one queue, but destination
	has 4, and we have 3 queues that are not properly initialized and
	anything can happen.

	So, how can we address this problem. Easy, just convince qemu-5.2
	that when it is running pc-5.1, it needs to set the number of queues
	for virtio-blk-devices to 1.

	That way we fix the cases 5 and 6.

	5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1

	qemu-5.2 -M pc-5.1 sets number of queues to be 1.
	qemu-5.1 -M pc-5.1 expects number of queues to be 1.

	correct. migration works.

	6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1

	qemu-5.1 -M pc-5.1 sets number of queues to be 1.
	qemu-5.2 -M pc-5.1 expects number of queues to be 1.

	correct. migration works.

	And now the other interesting case, case 3. In this case we have:

	3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1

	Here we have the same QEMU in both sides. So it doesn't matter a
	lot if we have set the number of queues to 1 or not, because
	they are the same.

	WRONG!

	Think what happens if we do one of this double migrations:

	A -> migrates -> B -> migrates -> C

	where:

	A: qemu-5.1 -M pc-5.1
	B: qemu-5.2 -M pc-5.1
	C: qemu-5.2 -M pc-5.1

	migration A -> B is case 6, so number of queues needs to be 1.

	migration B -> C is case 3, so we don't care. But actually we
	care because we haven't started the guest in qemu-5.2, it came
	migrated from qemu-5.1. So to be in the safe place, we need to
	always use number of queues 1 when we are using pc-5.1.

	Now, how was this done in reality? The following commit shows how it
	was done::

	commit 9445e1e15e66c19e42bea942ba810db28052cd05
	Author: Stefan Hajnoczi <stefanha@redhat.com>
	Date: Tue Aug 18 15:33:47 2020 +0100

	virtio-blk-pci: default num_queues to -smp N

	The relevant parts for migration are::

	@@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
	#endif
	DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
	true),
	- DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
	+ DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
	+ VIRTIO_BLK_AUTO_NUM_QUEUES),
	DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),

	It changes the default value of num_queues. But it fishes it for old
	machine types to have the right value::

	@@ -31,6 +31,7 @@
	GlobalProperty hw_compat_5_1[] = {
	...
	+ { "virtio-blk-device", "num-queues", "1"},
	...
	};

	A device with different features on both sides
	----------------------------------------------

	Let's assume that we are using the same QEMU binary on both sides,
	just to make the things easier. But we have a device that has
	different features on both sides of the migration. That can be
	because the devices are different, because the kernel driver of both
	devices have different features, whatever.

	How can we get this to work with migration. The way to do that is
	"theoretically" easy. You have to get the features that the device
	has in the source of the migration. The features that the device has
	on the target of the migration, you get the intersection of the
	features of both sides, and that is the way that you should launch
	QEMU.

	Notice that this is not completely related to QEMU. The most
	important thing here is that this should be handled by the managing
	application that launches QEMU. If QEMU is configured correctly, the
	migration will succeed.

	That said, actually doing it is complicated. Almost all devices are
	bad at being able to be launched with only some features enabled.
	With one big exception: cpus.

	You can read the documentation for QEMU x86 cpu models here:

	https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html

	See when they talk about migration they recommend that one chooses the
	newest cpu model that is supported for all cpus.

	Let's say that we have:

	Host A:

	Device X has the feature Y

	Host B:

	Device X has not the feature Y

	If we try to migrate without any care from host A to host B, it will
	fail because when migration tries to load the feature Y on
	destination, it will find that the hardware is not there.

	Doing this would be the equivalent of doing with cpus:

	Host A:

	$ qemu-system-x86_64 -cpu host

	Host B:

	$ qemu-system-x86_64 -cpu host

	When both hosts have different cpu features this is guaranteed to
	fail. Especially if Host B has less features than host A. If host A
	has less features than host B, sometimes it works. Important word of
	last sentence is "sometimes".

	So, forgetting about cpu models and continuing with the -cpu host
	example, let's see that the differences of the cpus is that Host A and
	B have the following features:

	Features: 'pcid' 'stibp' 'taa-no'
	Host A: X X
	Host B: X

	And we want to migrate between them, the way configure both QEMU cpu
	will be:

	Host A:

	$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off

	Host B:

	$ qemu-system-x86_64 -cpu host,taa-no=off

	And you would be able to migrate between them. It is responsibility
	of the management application or of the user to make sure that the
	configuration is correct. QEMU doesn't know how to look at this kind
	of features in general.

	Notice that we don't recommend to use -cpu host for migration. It is
	used in this example because it makes the example simpler.

	Other devices have worse control about individual features. If they
	want to be able to migrate between hosts that show different features,
	the device needs a way to configure which ones it is going to use.

	In this section we have considered that we are using the same QEMU
	binary in both sides of the migration. If we use different QEMU
	versions process, then we need to have into account all other
	differences and the examples become even more complicated.

	How to mitigate when we have a backward compatibility error
	-----------------------------------------------------------

	We broke migration for old machine types continuously during
	development. But as soon as we find that there is a problem, we fix
	it. The problem is what happens when we detect after we have done a
	release that something has gone wrong.

	Let see how it worked with one example.

	After the release of qemu-8.0 we found a problem when doing migration
	of the machine type pc-7.2.

	- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2

	This migration works

	- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2

	This migration works

	- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2

	This migration fails

	- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2

	This migration fails

	So clearly something fails when migration between qemu-7.2 and
	qemu-8.0 with machine type pc-7.2. The error messages, and git bisect
	pointed to this commit.

	In qemu-8.0 we got this commit::

	commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
	Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
	Date: Thu Mar 2 13:37:02 2023 +0000

	hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register


	The relevant bits of the commit for our example are this ones::

	--- a/hw/pci/pcie_aer.c
	+++ b/hw/pci/pcie_aer.c
	@@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,

	pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
	PCI_ERR_UNC_SUPPORTED);
	+ pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
	+ PCI_ERR_UNC_MASK_DEFAULT);
	+ pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
	+ PCI_ERR_UNC_SUPPORTED);

	pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
	PCI_ERR_UNC_SEVERITY_DEFAULT);

	The patch changes how we configure PCI space for AER. But QEMU fails
	when the PCI space configuration is different between source and
	destination.

	The following commit shows how this got fixed::

	commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
	Author: Leonardo Bras <leobras@redhat.com>
	Date: Tue May 2 21:27:02 2023 -0300

	hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0

	[...]

	The relevant parts of the fix in QEMU are as follow:

	First, we create a new property for the device to be able to configure
	the old behaviour or the new behaviour::

	diff --git a/hw/pci/pci.c b/hw/pci/pci.c
	index 8a87ccc8b0..5153ad63d6 100644
	--- a/hw/pci/pci.c
	+++ b/hw/pci/pci.c
	@@ -79,6 +79,8 @@ static Property pci_props[] = {
	DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
	failover_pair_id),
	DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0),
	+ DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
	+ QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
	DEFINE_PROP_END_OF_LIST()
	};

	Notice that we enable the feature for new machine types.

	Now we see how the fix is done. This is going to depend on what kind
	of breakage happens, but in this case it is quite simple::

	diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
	index 103667c368..374d593ead 100644
	--- a/hw/pci/pcie_aer.c
	+++ b/hw/pci/pcie_aer.c
	@@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
	uint16_t offset,

	pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
	PCI_ERR_UNC_SUPPORTED);
	- pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
	- PCI_ERR_UNC_MASK_DEFAULT);
	- pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
	- PCI_ERR_UNC_SUPPORTED);
	+
	+ if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
	+ pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
	+ PCI_ERR_UNC_MASK_DEFAULT);
	+ pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
	+ PCI_ERR_UNC_SUPPORTED);
	+ }

	pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
	PCI_ERR_UNC_SEVERITY_DEFAULT);

	I.e. If the property bit is enabled, we configure it as we did for
	qemu-8.0. If the property bit is not set, we configure it as it was in 7.2.

	And now, everything that is missing is disabling the feature for old
	machine types::

	diff --git a/hw/core/machine.c b/hw/core/machine.c
	index 47a34841a5..07f763eb2e 100644
	--- a/hw/core/machine.c
	+++ b/hw/core/machine.c
	@@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
	{ "e1000e", "migrate-timadj", "off" },
	{ "virtio-mem", "x-early-migration", "false" },
	{ "migration", "x-preempt-pre-7-2", "true" },
	+ { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
	};
	const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);

	And now, when qemu-8.0.1 is released with this fix, all combinations
	are going to work as supposed.

	- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
	- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
	- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
	- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)

	So the normality has been restored and everything is ok, no?

	Not really, now our matrix is much bigger. We started with the easy
	cases, migration from the same version to the same version always
	works:

	- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
	- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
	- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2

	Now the interesting ones. When the QEMU processes versions are
	different. For the 1st set, their fail and we can do nothing, both
	versions are released and we can't change anything.

	- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
	- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2

	This two are the ones that work. The whole point of making the
	change in qemu-8.0.1 release was to fix this issue:

	- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
	- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2

	But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
	qemu-8.0.1.

	- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
	- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2

	So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
	anything except to qemu-8.0.

	Can we do better?

	Yeap. If we know that we are going to do this migration:

	- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2

	We can launch the appropriate devices with::

	--device...,x-pci-e-err-unc-mask=on

	And now we can receive a migration from 8.0. And from now on, we can
	do that migration to new machine types if we remember to enable that
	property for pc-7.2. Notice that we need to remember, it is not
	enough to know that the source of the migration is qemu-8.0. Think of
	this example:

	$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2

	In the second migration, the source is not qemu-8.0, but we still have
	that "problem" and have that property enabled. Notice that we need to
	continue having this mark/property until we have this machine
	rebooted. But it is not a normal reboot (that don't reload QEMU) we
	need the machine to poweroff/poweron on a fixed QEMU. And from now
	on we can use the proper real machine.