|  | ============== | 
|  | The memory API | 
|  | ============== | 
|  |  | 
|  | The memory API models the memory and I/O buses and controllers of a QEMU | 
|  | machine.  It attempts to allow modelling of: | 
|  |  | 
|  | - ordinary RAM | 
|  | - memory-mapped I/O (MMIO) | 
|  | - memory controllers that can dynamically reroute physical memory regions | 
|  | to different destinations | 
|  |  | 
|  | The memory model provides support for | 
|  |  | 
|  | - tracking RAM changes by the guest | 
|  | - setting up coalesced memory for kvm | 
|  | - setting up ioeventfd regions for kvm | 
|  |  | 
|  | Memory is modelled as an acyclic graph of MemoryRegion objects.  Sinks | 
|  | (leaves) are RAM and MMIO regions, while other nodes represent | 
|  | buses, memory controllers, and memory regions that have been rerouted. | 
|  |  | 
|  | In addition to MemoryRegion objects, the memory API provides AddressSpace | 
|  | objects for every root and possibly for intermediate MemoryRegions too. | 
|  | These represent memory as seen from the CPU or a device's viewpoint. | 
|  |  | 
|  | Types of regions | 
|  | ---------------- | 
|  |  | 
|  | There are multiple types of memory regions (all represented by a single C type | 
|  | MemoryRegion): | 
|  |  | 
|  | - RAM: a RAM region is simply a range of host memory that can be made available | 
|  | to the guest. | 
|  | You typically initialize these with memory_region_init_ram().  Some special | 
|  | purposes require the variants memory_region_init_resizeable_ram(), | 
|  | memory_region_init_ram_from_file(), or memory_region_init_ram_ptr(). | 
|  |  | 
|  | - MMIO: a range of guest memory that is implemented by host callbacks; | 
|  | each read or write causes a callback to be called on the host. | 
|  | You initialize these with memory_region_init_io(), passing it a | 
|  | MemoryRegionOps structure describing the callbacks. | 
|  |  | 
|  | - ROM: a ROM memory region works like RAM for reads (directly accessing | 
|  | a region of host memory), and forbids writes. You initialize these with | 
|  | memory_region_init_rom(). | 
|  |  | 
|  | - ROM device: a ROM device memory region works like RAM for reads | 
|  | (directly accessing a region of host memory), but like MMIO for | 
|  | writes (invoking a callback).  You initialize these with | 
|  | memory_region_init_rom_device(). | 
|  |  | 
|  | - IOMMU region: an IOMMU region translates addresses of accesses made to it | 
|  | and forwards them to some other target memory region.  As the name suggests, | 
|  | these are only needed for modelling an IOMMU, not for simple devices. | 
|  | You initialize these with memory_region_init_iommu(). | 
|  |  | 
|  | - container: a container simply includes other memory regions, each at | 
|  | a different offset.  Containers are useful for grouping several regions | 
|  | into one unit.  For example, a PCI BAR may be composed of a RAM region | 
|  | and an MMIO region. | 
|  |  | 
|  | A container's subregions are usually non-overlapping.  In some cases it is | 
|  | useful to have overlapping regions; for example a memory controller that | 
|  | can overlay a subregion of RAM with MMIO or ROM, or a PCI controller | 
|  | that does not prevent card from claiming overlapping BARs. | 
|  |  | 
|  | You initialize a pure container with memory_region_init(). | 
|  |  | 
|  | - alias: a subsection of another region. Aliases allow a region to be | 
|  | split apart into discontiguous regions. Examples of uses are memory | 
|  | banks used when the guest address space is smaller than the amount | 
|  | of RAM addressed, or a memory controller that splits main memory to | 
|  | expose a "PCI hole". You can also create aliases to avoid trying to | 
|  | add the original region to multiple parents via | 
|  | `memory_region_add_subregion`. | 
|  |  | 
|  | Aliases may point to any type of region, including other aliases, | 
|  | but an alias may not point back to itself, directly or indirectly. | 
|  | You initialize these with memory_region_init_alias(). | 
|  |  | 
|  | - reservation region: a reservation region is primarily for debugging. | 
|  | It claims I/O space that is not supposed to be handled by QEMU itself. | 
|  | The typical use is to track parts of the address space which will be | 
|  | handled by the host kernel when KVM is enabled.  You initialize these | 
|  | by passing a NULL callback parameter to memory_region_init_io(). | 
|  |  | 
|  | It is valid to add subregions to a region which is not a pure container | 
|  | (that is, to an MMIO, RAM or ROM region). This means that the region | 
|  | will act like a container, except that any addresses within the container's | 
|  | region which are not claimed by any subregion are handled by the | 
|  | container itself (ie by its MMIO callbacks or RAM backing). However | 
|  | it is generally possible to achieve the same effect with a pure container | 
|  | one of whose subregions is a low priority "background" region covering | 
|  | the whole address range; this is often clearer and is preferred. | 
|  | Subregions cannot be added to an alias region. | 
|  |  | 
|  | Migration | 
|  | --------- | 
|  |  | 
|  | Where the memory region is backed by host memory (RAM, ROM and | 
|  | ROM device memory region types), this host memory needs to be | 
|  | copied to the destination on migration. These APIs which allocate | 
|  | the host memory for you will also register the memory so it is | 
|  | migrated: | 
|  |  | 
|  | - memory_region_init_ram() | 
|  | - memory_region_init_rom() | 
|  | - memory_region_init_rom_device() | 
|  |  | 
|  | For most devices and boards this is the correct thing. If you | 
|  | have a special case where you need to manage the migration of | 
|  | the backing memory yourself, you can call the functions: | 
|  |  | 
|  | - memory_region_init_ram_nomigrate() | 
|  | - memory_region_init_rom_nomigrate() | 
|  | - memory_region_init_rom_device_nomigrate() | 
|  |  | 
|  | which only initialize the MemoryRegion and leave handling | 
|  | migration to the caller. | 
|  |  | 
|  | The functions: | 
|  |  | 
|  | - memory_region_init_resizeable_ram() | 
|  | - memory_region_init_ram_from_file() | 
|  | - memory_region_init_ram_from_fd() | 
|  | - memory_region_init_ram_ptr() | 
|  | - memory_region_init_ram_device_ptr() | 
|  |  | 
|  | are for special cases only, and so they do not automatically | 
|  | register the backing memory for migration; the caller must | 
|  | manage migration if necessary. | 
|  |  | 
|  | Region names | 
|  | ------------ | 
|  |  | 
|  | Regions are assigned names by the constructor.  For most regions these are | 
|  | only used for debugging purposes, but RAM regions also use the name to identify | 
|  | live migration sections.  This means that RAM region names need to have ABI | 
|  | stability. | 
|  |  | 
|  | Region lifecycle | 
|  | ---------------- | 
|  |  | 
|  | A region is created by one of the memory_region_init*() functions and | 
|  | attached to an object, which acts as its owner or parent.  QEMU ensures | 
|  | that the owner object remains alive as long as the region is visible to | 
|  | the guest, or as long as the region is in use by a virtual CPU or another | 
|  | device.  For example, the owner object will not die between an | 
|  | address_space_map operation and the corresponding address_space_unmap. | 
|  |  | 
|  | After creation, a region can be added to an address space or a | 
|  | container with memory_region_add_subregion(), and removed using | 
|  | memory_region_del_subregion(). | 
|  |  | 
|  | Various region attributes (read-only, dirty logging, coalesced mmio, | 
|  | ioeventfd) can be changed during the region lifecycle.  They take effect | 
|  | as soon as the region is made visible.  This can be immediately, later, | 
|  | or never. | 
|  |  | 
|  | Destruction of a memory region happens automatically when the owner | 
|  | object dies. | 
|  |  | 
|  | If however the memory region is part of a dynamically allocated data | 
|  | structure, you should call object_unparent() to destroy the memory region | 
|  | before the data structure is freed.  For an example see VFIOMSIXInfo | 
|  | and VFIOQuirk in hw/vfio/pci.c. | 
|  |  | 
|  | You must not destroy a memory region as long as it may be in use by a | 
|  | device or CPU.  In order to do this, as a general rule do not create or | 
|  | destroy memory regions dynamically during a device's lifetime, and only | 
|  | call object_unparent() in the memory region owner's instance_finalize | 
|  | callback.  The dynamically allocated data structure that contains the | 
|  | memory region then should obviously be freed in the instance_finalize | 
|  | callback as well. | 
|  |  | 
|  | If you break this rule, the following situation can happen: | 
|  |  | 
|  | - the memory region's owner had a reference taken via memory_region_ref | 
|  | (for example by address_space_map) | 
|  |  | 
|  | - the region is unparented, and has no owner anymore | 
|  |  | 
|  | - when address_space_unmap is called, the reference to the memory region's | 
|  | owner is leaked. | 
|  |  | 
|  |  | 
|  | There is an exception to the above rule: it is okay to call | 
|  | object_unparent at any time for an alias or a container region.  It is | 
|  | therefore also okay to create or destroy alias and container regions | 
|  | dynamically during a device's lifetime. | 
|  |  | 
|  | This exceptional usage is valid because aliases and containers only help | 
|  | QEMU building the guest's memory map; they are never accessed directly. | 
|  | memory_region_ref and memory_region_unref are never called on aliases | 
|  | or containers, and the above situation then cannot happen.  Exploiting | 
|  | this exception is rarely necessary, and therefore it is discouraged, | 
|  | but nevertheless it is used in a few places. | 
|  |  | 
|  | For regions that "have no owner" (NULL is passed at creation time), the | 
|  | machine object is actually used as the owner.  Since instance_finalize is | 
|  | never called for the machine object, you must never call object_unparent | 
|  | on regions that have no owner, unless they are aliases or containers. | 
|  |  | 
|  |  | 
|  | Overlapping regions and priority | 
|  | -------------------------------- | 
|  | Usually, regions may not overlap each other; a memory address decodes into | 
|  | exactly one target.  In some cases it is useful to allow regions to overlap, | 
|  | and sometimes to control which of an overlapping regions is visible to the | 
|  | guest.  This is done with memory_region_add_subregion_overlap(), which | 
|  | allows the region to overlap any other region in the same container, and | 
|  | specifies a priority that allows the core to decide which of two regions at | 
|  | the same address are visible (highest wins). | 
|  | Priority values are signed, and the default value is zero. This means that | 
|  | you can use memory_region_add_subregion_overlap() both to specify a region | 
|  | that must sit 'above' any others (with a positive priority) and also a | 
|  | background region that sits 'below' others (with a negative priority). | 
|  |  | 
|  | If the higher priority region in an overlap is a container or alias, then | 
|  | the lower priority region will appear in any "holes" that the higher priority | 
|  | region has left by not mapping subregions to that area of its address range. | 
|  | (This applies recursively -- if the subregions are themselves containers or | 
|  | aliases that leave holes then the lower priority region will appear in these | 
|  | holes too.) | 
|  |  | 
|  | For example, suppose we have a container A of size 0x8000 with two subregions | 
|  | B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is | 
|  | an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two | 
|  | of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at | 
|  | offset 0x2000. As a diagram:: | 
|  |  | 
|  | 0      1000   2000   3000   4000   5000   6000   7000   8000 | 
|  | |------|------|------|------|------|------|------|------| | 
|  | A:    [                                                      ] | 
|  | C:    [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC] | 
|  | B:                  [                          ] | 
|  | D:                  [DDDDD] | 
|  | E:                                [EEEEE] | 
|  |  | 
|  | The regions that will be seen within this address range then are:: | 
|  |  | 
|  | [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC] | 
|  |  | 
|  | Since B has higher priority than C, its subregions appear in the flat map | 
|  | even where they overlap with C. In ranges where B has not mapped anything | 
|  | C's region appears. | 
|  |  | 
|  | If B had provided its own MMIO operations (ie it was not a pure container) | 
|  | then these would be used for any addresses in its range not handled by | 
|  | D or E, and the result would be:: | 
|  |  | 
|  | [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB] | 
|  |  | 
|  | Priority values are local to a container, because the priorities of two | 
|  | regions are only compared when they are both children of the same container. | 
|  | This means that the device in charge of the container (typically modelling | 
|  | a bus or a memory controller) can use them to manage the interaction of | 
|  | its child regions without any side effects on other parts of the system. | 
|  | In the example above, the priorities of D and E are unimportant because | 
|  | they do not overlap each other. It is the relative priority of B and C | 
|  | that causes D and E to appear on top of C: D and E's priorities are never | 
|  | compared against the priority of C. | 
|  |  | 
|  | Visibility | 
|  | ---------- | 
|  | The memory core uses the following rules to select a memory region when the | 
|  | guest accesses an address: | 
|  |  | 
|  | - all direct subregions of the root region are matched against the address, in | 
|  | descending priority order | 
|  |  | 
|  | - if the address lies outside the region offset/size, the subregion is | 
|  | discarded | 
|  | - if the subregion is a leaf (RAM or MMIO), the search terminates, returning | 
|  | this leaf region | 
|  | - if the subregion is a container, the same algorithm is used within the | 
|  | subregion (after the address is adjusted by the subregion offset) | 
|  | - if the subregion is an alias, the search is continued at the alias target | 
|  | (after the address is adjusted by the subregion offset and alias offset) | 
|  | - if a recursive search within a container or alias subregion does not | 
|  | find a match (because of a "hole" in the container's coverage of its | 
|  | address range), then if this is a container with its own MMIO or RAM | 
|  | backing the search terminates, returning the container itself. Otherwise | 
|  | we continue with the next subregion in priority order | 
|  |  | 
|  | - if none of the subregions match the address then the search terminates | 
|  | with no match found | 
|  |  | 
|  | Example memory map | 
|  | ------------------ | 
|  |  | 
|  | :: | 
|  |  | 
|  | system_memory: container@0-2^48-1 | 
|  | | | 
|  | +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff) | 
|  | | | 
|  | +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff) | 
|  | | | 
|  | +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff) | 
|  | |      (prio 1) | 
|  | | | 
|  | +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff) | 
|  |  | 
|  | pci (0-2^32-1) | 
|  | | | 
|  | +--- vga-area: container@0xa0000-0xbffff | 
|  | |      | | 
|  | |      +--- alias@0x00000-0x7fff  ---> #vram (0x010000-0x017fff) | 
|  | |      | | 
|  | |      +--- alias@0x08000-0xffff  ---> #vram (0x020000-0x027fff) | 
|  | | | 
|  | +---- vram: ram@0xe1000000-0xe1ffffff | 
|  | | | 
|  | +---- vga-mmio: mmio@0xe2000000-0xe200ffff | 
|  |  | 
|  | ram: ram@0x00000000-0xffffffff | 
|  |  | 
|  | This is a (simplified) PC memory map. The 4GB RAM block is mapped into the | 
|  | system address space via two aliases: "lomem" is a 1:1 mapping of the first | 
|  | 3.5GB; "himem" maps the last 0.5GB at address 4GB.  This leaves 0.5GB for the | 
|  | so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with | 
|  | 4GB of memory. | 
|  |  | 
|  | The memory controller diverts addresses in the range 640K-768K to the PCI | 
|  | address space.  This is modelled using the "vga-window" alias, mapped at a | 
|  | higher priority so it obscures the RAM at the same addresses.  The vga window | 
|  | can be removed by programming the memory controller; this is modelled by | 
|  | removing the alias and exposing the RAM underneath. | 
|  |  | 
|  | The pci address space is not a direct child of the system address space, since | 
|  | we only want parts of it to be visible (we accomplish this using aliases). | 
|  | It has two subregions: vga-area models the legacy vga window and is occupied | 
|  | by two 32K memory banks pointing at two sections of the framebuffer. | 
|  | In addition the vram is mapped as a BAR at address e1000000, and an additional | 
|  | BAR containing MMIO registers is mapped after it. | 
|  |  | 
|  | Note that if the guest maps a BAR outside the PCI hole, it would not be | 
|  | visible as the pci-hole alias clips it to a 0.5GB range. | 
|  |  | 
|  | MMIO Operations | 
|  | --------------- | 
|  |  | 
|  | MMIO regions are provided with ->read() and ->write() callbacks, | 
|  | which are sufficient for most devices. Some devices change behaviour | 
|  | based on the attributes used for the memory transaction, or need | 
|  | to be able to respond that the access should provoke a bus error | 
|  | rather than completing successfully; those devices can use the | 
|  | ->read_with_attrs() and ->write_with_attrs() callbacks instead. | 
|  |  | 
|  | In addition various constraints can be supplied to control how these | 
|  | callbacks are called: | 
|  |  | 
|  | - .valid.min_access_size, .valid.max_access_size define the access sizes | 
|  | (in bytes) which the device accepts; accesses outside this range will | 
|  | have device and bus specific behaviour (ignored, or machine check) | 
|  | - .valid.unaligned specifies that the *device being modelled* supports | 
|  | unaligned accesses; if false, unaligned accesses will invoke the | 
|  | appropriate bus or CPU specific behaviour. | 
|  | - .impl.min_access_size, .impl.max_access_size define the access sizes | 
|  | (in bytes) supported by the *implementation*; other access sizes will be | 
|  | emulated using the ones available.  For example a 4-byte write will be | 
|  | emulated using four 1-byte writes, if .impl.max_access_size = 1. | 
|  | - .impl.unaligned specifies that the *implementation* supports unaligned | 
|  | accesses; if false, unaligned accesses will be emulated by two aligned | 
|  | accesses. | 
|  |  | 
|  | API Reference | 
|  | ------------- | 
|  |  | 
|  | .. kernel-doc:: include/system/memory.h |