| |
| NUMA mechanics for sPAPR (pseries machines) |
| ============================================ |
| |
| NUMA in sPAPR works different than the System Locality Distance |
| Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR |
| 1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This |
| document aims to complement this specification, providing details |
| of the elements that impacts how QEMU views NUMA in pseries. |
| |
| Associativity and ibm,associativity property |
| -------------------------------------------- |
| |
| Associativity is defined as a group of platform resources that has |
| similar mean performance (or in our context here, distance) relative to |
| everyone else outside of the group. |
| |
| The format of the ibm,associativity property varies with the value of |
| bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with |
| bit 0 equal to zero is deprecated. The current format, with the bit 0 |
| with the value of one, makes ibm,associativity property represent the |
| physical hierarchy of the platform, as one or more lists that starts |
| with the highest level grouping up to the smallest. Considering the |
| following topology: |
| |
| :: |
| |
| Mem M1 ---- Proc P1 | |
| ----------------- | Socket S1 ---| |
| chip C1 | | |
| | HW module 1 (MOD1) |
| Mem M2 ---- Proc P2 | | |
| ----------------- | Socket S2 ---| |
| chip C2 | |
| |
| The ibm,associativity property for the processors would be: |
| |
| * P1: {MOD1, S1, C1, P1} |
| * P2: {MOD1, S2, C2, P2} |
| |
| Each allocable resource has an ibm,associativity property. The LOPAPR |
| specification allows multiple lists to be present in this property, |
| considering that the same resource can have multiple connections to the |
| platform. |
| |
| Relative Performance Distance and ibm,associativity-reference-points |
| -------------------------------------------------------------------- |
| |
| The ibm,associativity-reference-points property is an array that is used |
| to define the relevant performance/distance related boundaries, defining |
| the NUMA levels for the platform. |
| |
| The definition of its elements also varies with the value of bit 0 of byte 5 |
| of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero |
| is also deprecated. With the current format, each integer of the |
| ibm,associativity-reference-points represents an 1 based ordinal index (i.e. |
| the first element is 1) of the ibm,associativity array. The first |
| boundary is the most significant to application performance, followed by |
| less significant boundaries. Allocated resources that belongs to the |
| same performance boundaries are expected to have relative NUMA distance |
| that matches the relevancy of the boundary itself. Resources that belongs |
| to the same first boundary will have the shortest distance from each |
| other. Subsequent boundaries represents greater distances and degraded |
| performance. |
| |
| Using the previous example, the following setting reference points defines |
| three NUMA levels: |
| |
| * ibm,associativity-reference-points = {0x3, 0x2, 0x1} |
| |
| The first NUMA level (0x3) is interpreted as the third element of each |
| ibm,associativity array, the second level is the second element and |
| the third level is the first element. Let's also consider that elements |
| belonging to the first NUMA level have distance equal to 10 from each |
| other, and each NUMA level doubles the distance from the previous. This |
| means that the second would be 20 and the third level 40. For the P1 and |
| P2 processors, we would have the following NUMA levels: |
| |
| :: |
| |
| * ibm,associativity-reference-points = {0x3, 0x2, 0x1} |
| |
| * P1: associativity{MOD1, S1, C1, P1} |
| |
| First NUMA level (0x3) => associativity[2] = C1 |
| Second NUMA level (0x2) => associativity[1] = S1 |
| Third NUMA level (0x1) => associativity[0] = MOD1 |
| |
| * P2: associativity{MOD1, S2, C2, P2} |
| |
| First NUMA level (0x3) => associativity[2] = C2 |
| Second NUMA level (0x2) => associativity[1] = S2 |
| Third NUMA level (0x1) => associativity[0] = MOD1 |
| |
| P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 |
| |
| Changing the ibm,associativity-reference-points array changes the performance |
| distance attributes for the same associativity arrays, as the following |
| example illustrates: |
| |
| :: |
| |
| * ibm,associativity-reference-points = {0x2} |
| |
| * P1: associativity{MOD1, S1, C1, P1} |
| |
| First NUMA level (0x2) => associativity[1] = S1 |
| |
| * P2: associativity{MOD1, S2, C2, P2} |
| |
| First NUMA level (0x2) => associativity[1] = S2 |
| |
| P1 and P2 does not have a common performance boundary. Since this is a one level |
| NUMA configuration, distance between them is one boundary above the first |
| level, 20. |
| |
| |
| In a hypothetical platform where all resources inside the same hardware module |
| is considered to be on the same performance boundary: |
| |
| :: |
| |
| * ibm,associativity-reference-points = {0x1} |
| |
| * P1: associativity{MOD1, S1, C1, P1} |
| |
| First NUMA level (0x1) => associativity[0] = MOD0 |
| |
| * P2: associativity{MOD1, S2, C2, P2} |
| |
| First NUMA level (0x1) => associativity[0] = MOD0 |
| |
| P1 and P2 belongs to the same first order boundary. The distance between then |
| is 10. |
| |
| |
| How the pseries Linux guest calculates NUMA distances |
| ===================================================== |
| |
| Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is |
| how the distances are expressed. The SLIT table provides the NUMA distance |
| value between the relevant resources. LOPAPR does not provide a standard |
| way to calculate it. We have the ibm,associativity for each resource, which |
| provides a common-performance hierarchy, and the ibm,associativity-reference-points |
| array that tells which level of associativity is considered to be relevant |
| or not. |
| |
| The result is that each OS is free to implement and to interpret the distance |
| as it sees fit. For the pseries Linux guest, each level of NUMA duplicates |
| the distance of the previous level, and the maximum amount of levels is |
| limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the |
| kernel tree). This results in the following distances: |
| |
| * both resources in the first NUMA level: 10 |
| * resources one NUMA level apart: 20 |
| * resources two NUMA levels apart: 40 |
| * resources three NUMA levels apart: 80 |
| * resources four NUMA levels apart: 160 |
| |
| |
| pseries NUMA mechanics |
| ====================== |
| |
| Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA |
| topology of the guest. The overall design is: |
| |
| * ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing |
| for 4 distinct NUMA distance values based on the NUMA levels |
| |
| * ibm,max-associativity-domains supports multiple associativity domains in all |
| NUMA levels, granting user flexibility |
| |
| * ibm,associativity for all resources varies with user input |
| |
| These changes are only effective for pseries-5.2 and newer machines that are |
| created with more than one NUMA node (disconsidering NUMA nodes created by |
| the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been |
| around for such a long time, with users seeing NUMA distances 10 and 40 |
| (and 80 if using NVLink2 GPUs), and there is no need to disrupt the |
| existing experience of those guests. |
| |
| To bring the user experience x86 users have when tuning up NUMA, we had |
| to operate under the current pseries Linux kernel logic described in |
| `How the pseries Linux guest calculates NUMA distances`_. The result |
| is that we needed to translate NUMA distance user input to pseries |
| Linux kernel input. |
| |
| Translating user distance to kernel distance |
| -------------------------------------------- |
| |
| User input for NUMA distance can vary from 10 to 254. We need to translate |
| that to the values that the Linux kernel operates on (10, 20, 40, 80, 160). |
| This is how it is being done: |
| |
| * user distance 11 to 30 will be interpreted as 20 |
| * user distance 31 to 60 will be interpreted as 40 |
| * user distance 61 to 120 will be interpreted as 80 |
| * user distance 121 and beyond will be interpreted as 160 |
| * user distance 10 stays 10 |
| |
| The reasoning behind this aproximation is to avoid any round up to the local |
| distance (10), keeping it exclusive to the 4th NUMA level (which is still |
| exclusive to the node_id). All other ranges were chosen under the developer |
| discretion of what would be (somewhat) sensible considering the user input. |
| Any other strategy can be used here, but in the end the reality is that we'll |
| have to accept that a large array of values will be translated to the same |
| NUMA topology in the guest, e.g. this user input: |
| |
| :: |
| |
| 0 1 2 |
| 0 10 31 120 |
| 1 31 10 30 |
| 2 120 30 10 |
| |
| And this other user input: |
| |
| :: |
| |
| 0 1 2 |
| 0 10 60 61 |
| 1 60 10 11 |
| 2 61 11 10 |
| |
| Will both be translated to the same values internally: |
| |
| :: |
| |
| 0 1 2 |
| 0 10 40 80 |
| 1 40 10 20 |
| 2 80 20 10 |
| |
| Users are encouraged to use only the kernel values in the NUMA definition to |
| avoid being taken by surprise with that the guest is actually seeing in the |
| topology. There are enough potential surprises that are inherent to the |
| associativity domain assignment process, discussed below. |
| |
| |
| How associativity domains are assigned |
| -------------------------------------- |
| |
| LOPAPR allows more than one associativity array (or 'string') per allocated |
| resource. This would be used to represent that the resource has multiple |
| connections with the board, and then the operational system, when deciding |
| NUMA distancing, should consider the associativity information that provides |
| the shortest distance. |
| |
| The spapr implementation does not support multiple associativity arrays per |
| resource, neither does the pseries Linux kernel. We'll have to represent the |
| NUMA topology using one associativity per resource, which means that choices |
| and compromises are going to be made. |
| |
| Consider the following NUMA topology entered by user input: |
| |
| :: |
| |
| 0 1 2 3 |
| 0 10 40 20 40 |
| 1 40 10 80 40 |
| 2 20 80 10 20 |
| 3 40 40 20 10 |
| |
| All the associativity arrays are initialized with NUMA id in all associativity |
| domains: |
| |
| * node 0: 0 0 0 0 |
| * node 1: 1 1 1 1 |
| * node 2: 2 2 2 2 |
| * node 3: 3 3 3 3 |
| |
| |
| Honoring just the relative distances of node 0 to every other node, we find the |
| NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for |
| each distance: |
| |
| * distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match |
| at 0x2) |
| * distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) |
| * distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match |
| at 0x2) |
| |
| We'll copy the associativity domains of node 0 to all other nodes, based on |
| the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy |
| the domains 0x2 and 0x1 from 0 to 1 as well. This will give us: |
| |
| * node 0: 0 0 0 0 |
| * node 1: 0 0 1 1 |
| |
| Doing the same to node 2 and node 3, these are the associativity arrays |
| after considering all matches with node 0: |
| |
| * node 0: 0 0 0 0 |
| * node 1: 0 0 1 1 |
| * node 2: 0 0 0 2 |
| * node 3: 0 0 3 3 |
| |
| The distances related to node 0 are accounted for. For node 1, and keeping |
| in mind that we don't need to revisit node 0 again, the distance from |
| node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40, |
| match in 0x2. Repeating the same logic of copying all domains up to |
| the NUMA level match: |
| |
| * node 0: 0 0 0 0 |
| * node 1: 1 0 1 1 |
| * node 2: 1 0 0 2 |
| * node 3: 1 0 3 3 |
| |
| In the last step we will analyze just nodes 2 and 3. The desired distance |
| between 2 and 3 is 20, i.e. a match in 0x3: |
| |
| * node 0: 0 0 0 0 |
| * node 1: 1 0 1 1 |
| * node 2: 1 0 0 2 |
| * node 3: 1 0 0 3 |
| |
| |
| The kernel will read these arrays and will calculate the following NUMA topology for |
| the guest: |
| |
| :: |
| |
| 0 1 2 3 |
| 0 10 40 20 20 |
| 1 40 10 40 40 |
| 2 20 40 10 20 |
| 3 20 40 20 10 |
| |
| Note that this is not what the user wanted - the desired distance between |
| 0 and 3 is 40, we calculated it as 20. This is what the current logic and |
| implementation constraints of the kernel and QEMU will provide inside the |
| LOPAPR specification. |
| |
| Users are welcome to use this knowledge and experiment with the input to get |
| the NUMA topology they want, or as closer as they want. The important thing |
| is to keep expectations up to par with what we are capable of provide at this |
| moment: an approximation. |
| |
| Limitations of the implementation |
| --------------------------------- |
| |
| As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate |
| user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways |
| to fully map user input to actual NUMA distance the guest will use. These limitations |
| creates two notable limitations in our support: |
| |
| * Asymmetrical topologies aren't supported. We only support NUMA topologies where |
| the distance from node A to B is always the same as B to A. We do not support |
| any A-B pair where the distance back and forth is asymmetric. For example, the |
| following topology isn't supported and the pSeries guest will not boot with this |
| user input: |
| |
| :: |
| |
| 0 1 |
| 0 10 40 |
| 1 20 10 |
| |
| |
| * 'non-transitive' topologies will be poorly translated to the guest. This is the |
| kind of topology where the distance from a node A to B is X, B to C is X, but |
| the distance A to C is not X. E.g.: |
| |
| :: |
| |
| 0 1 2 3 |
| 0 10 20 20 40 |
| 1 20 10 80 40 |
| 2 20 80 10 20 |
| 3 40 40 20 10 |
| |
| In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40. |
| The kernel will always match with the shortest associativity domain possible, |
| and we're attempting to retain the previous established relations between the |
| nodes. This means that a distance equal to 20 between nodes 0 and 2 and the |
| same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3 |
| to also be 20. |
| |
| |
| Legacy (5.1 and older) pseries NUMA mechanics |
| ============================================= |
| |
| In short, we can summarize the NUMA distances seem in pseries Linux guests, using |
| QEMU up to 5.1, as follows: |
| |
| * local distance, i.e. the distance of the resource to its own NUMA node: 10 |
| * if it's a NVLink GPU device, distance: 80 |
| * every other resource, distance: 40 |
| |
| The way the pseries Linux guest calculates NUMA distances has a direct effect |
| on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is |
| the default ibm,associativity-reference-points being used in the pseries |
| machine: |
| |
| ibm,associativity-reference-points = {0x4, 0x4, 0x2} |
| |
| The first and second level are equal, 0x4, and a third one was added in |
| commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that |
| regardless of how the ibm,associativity properties are being created in |
| the device tree, the pseries Linux guest will only recognize three scenarios |
| as far as NUMA distance goes: |
| |
| * if the resources belongs to the same first NUMA level = 10 |
| * second level is skipped since it's equal to the first |
| * all resources that aren't a NVLink GPU, it is guaranteed that they will belong |
| to the same third NUMA level, having distance = 40 |
| * for NVLink GPUs, distance = 80 from everything else |
| |
| This also means that user input in QEMU command line does not change the |
| NUMA distancing inside the guest for the pseries machine. |