Anthony Harivel | 0418f90 | 2024-05-22 17:34:52 +0200 | [diff] [blame] | 1 | ================ |
| 2 | RAPL MSR support |
| 3 | ================ |
| 4 | |
| 5 | The RAPL interface (Running Average Power Limit) is advertising the accumulated |
| 6 | energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). |
| 7 | |
| 8 | The consumption is reported via MSRs (model specific registers) like |
| 9 | MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits |
| 10 | registers that represent the accumulated energy consumption in micro Joules. |
| 11 | |
Paolo Bonzini | 232c3a8 | 2024-10-11 11:27:21 +0200 | [diff] [blame] | 12 | Thanks to KVM's `MSR filtering <msr-filter-patch_>`__ functionality, |
| 13 | not all MSRs are handled by KVM. Some of them can now be handled by the |
| 14 | userspace (QEMU); a list of MSRs is given at VM creation time to KVM, and |
| 15 | a userspace exit occurs when they are accessed. |
| 16 | |
| 17 | .. _msr-filter-patch: https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/ |
Anthony Harivel | 0418f90 | 2024-05-22 17:34:52 +0200 | [diff] [blame] | 18 | |
| 19 | At the moment the following MSRs are involved: |
| 20 | |
| 21 | .. code:: C |
| 22 | |
| 23 | #define MSR_RAPL_POWER_UNIT 0x00000606 |
| 24 | #define MSR_PKG_POWER_LIMIT 0x00000610 |
| 25 | #define MSR_PKG_ENERGY_STATUS 0x00000611 |
| 26 | #define MSR_PKG_POWER_INFO 0x00000614 |
| 27 | |
| 28 | The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL |
| 29 | spec and specify the power limit of the package, provide range of parameter(min |
| 30 | power, max power,..) and also the information of the multiplier for the energy |
| 31 | counter to calculate the power. Those MSRs are populated once at the beginning |
| 32 | by reading the host CPU MSRs and are given back to the guest 1:1 when |
| 33 | requested. |
| 34 | |
| 35 | The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of |
| 36 | energy consumed since the last time the register was cleared. If you multiply |
| 37 | it with the UNIT provided above you'll get the power in micro-joules. This |
| 38 | counter is always increasing and it increases more or less faster depending on |
| 39 | the consumption of the package. This counter is supposed to overflow at some |
| 40 | point. |
| 41 | |
| 42 | Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e |
| 43 | "rdmsr 0x611") will retrieve the same value. The value represents the energy |
| 44 | for the whole package. Whatever Core reading it will get the same value and a |
| 45 | core that belongs to PKG-0 will not be able to get the value of PKG-1 and |
| 46 | vice-versa. |
| 47 | |
| 48 | High level implementation |
| 49 | ------------------------- |
| 50 | |
| 51 | In order to update the value of the virtual MSR, a QEMU thread is created. |
| 52 | The thread is basically just an infinity loop that does: |
| 53 | |
| 54 | 1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in |
| 55 | Userspace and System) |
| 56 | |
| 57 | 2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where |
| 58 | the QEMU threads are running on. |
| 59 | |
| 60 | 3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads |
| 61 | will do what they have to do and so the energy counter will increase. |
| 62 | |
| 63 | 4. Repeat 2. and 3. and calculate the delta of every metrics representing the |
| 64 | time spent scheduled for each QEMU thread *and* the energy spent by the |
| 65 | packages during the pause. |
| 66 | |
| 67 | 5. Filter the vcpu threads and the non-vcpu threads. |
| 68 | |
| 69 | 6. Retrieve the topology of the Virtual Machine. This helps identify which |
| 70 | vCPU is running on which virtual package. |
| 71 | |
| 72 | 7. The total energy spent by the non-vcpu threads is divided by the number |
| 73 | of vcpu threads so that each vcpu thread will get an equal part of the |
| 74 | energy spent by the QEMU workers. |
| 75 | |
| 76 | 8. Calculate the ratio of energy spent per vcpu threads. |
| 77 | |
| 78 | 9. Calculate the energy for each virtual package. |
| 79 | |
| 80 | 10. The virtual MSRs are updated for each virtual package. Each vCPU that |
| 81 | belongs to the same package will return the same value when accessing the |
| 82 | the MSR. |
| 83 | |
| 84 | 11. Loop back to 1. |
| 85 | |
| 86 | Ratio calculation |
| 87 | ----------------- |
| 88 | |
| 89 | In Linux, a process has an execution time associated with it. The scheduler is |
| 90 | dividing the time in clock ticks. The number of clock ticks per second can be |
| 91 | found by the sysconf system call. A typical value of clock ticks per second is |
| 92 | 100. So a core can run a process at the maximum of 100 ticks per second. If a |
| 93 | package has 4 cores, 400 ticks maximum can be scheduled on all the cores |
| 94 | of the package for a period of 1 second. |
| 95 | |
Paolo Bonzini | 232c3a8 | 2024-10-11 11:27:21 +0200 | [diff] [blame] | 96 | `/proc/[pid]/stat <stat_>`__ is a procfs file that can give the executed |
| 97 | time of a process with the [pid] as the process ID. It gives the amount |
| 98 | of ticks the process has been scheduled in userspace (utime) and kernel |
| 99 | space (stime). |
| 100 | |
| 101 | .. _stat: https://man7.org/linux/man-pages/man5/proc.5.html |
Anthony Harivel | 0418f90 | 2024-05-22 17:34:52 +0200 | [diff] [blame] | 102 | |
| 103 | By reading those metrics for a thread, one can calculate the ratio of time the |
| 104 | package has spent executing the thread. |
| 105 | |
| 106 | Example: |
| 107 | |
| 108 | A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks |
| 109 | per second per core. If a thread was scheduled for 100 ticks between a second |
| 110 | on this package, that means my thread has been scheduled for 1/4 of the whole |
| 111 | package. With that, the calculation of the energy spent by the thread on this |
| 112 | package during this whole second is 1/4 of the total energy spent by the |
| 113 | package. |
| 114 | |
| 115 | Usage |
| 116 | ----- |
| 117 | |
| 118 | Currently this feature is only working on an Intel CPU that has the RAPL driver |
| 119 | mounted and available in the sysfs. if not, QEMU fails at start-up. |
| 120 | |
| 121 | This feature is activated with -accel |
| 122 | kvm,rapl=true,rapl-helper-socket=/path/sock.sock |
| 123 | |
| 124 | It is important that the socket path is the same as the one |
| 125 | :program:`qemu-vmsr-helper` is listening to. |
| 126 | |
| 127 | qemu-vmsr-helper |
| 128 | ---------------- |
| 129 | |
| 130 | The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of |
| 131 | making persistent reservation, qemu-vmsr-helper is here to overcome the |
| 132 | CVE-2020-8694 which remove user access to the rapl msr attributes. |
| 133 | |
| 134 | A socket communication is established between QEMU processes that has the RAPL |
| 135 | MSR support activated and the qemu-vmsr-helper. A systemd service and socket |
| 136 | activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket). |
| 137 | |
| 138 | The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The |
| 139 | socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be |
| 140 | changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could |
| 141 | also start a separate helper if needed. All in all, the policy is left to the |
| 142 | user. |
| 143 | |
| 144 | See the qemu-pr-helper documentation or manpage for further details. |
| 145 | |
| 146 | Current Limitations |
| 147 | ------------------- |
| 148 | |
| 149 | - Works only on Intel host CPUs because AMD CPUs are using different MSR |
| 150 | addresses. |
| 151 | |
| 152 | - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the |
| 153 | moment. |
| 154 | |