Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 1 | .. |
| 2 | Copyright (c) 2015-2020 Linaro Ltd. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 3 | |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 4 | This work is licensed under the terms of the GNU GPL, version 2 or |
| 5 | later. See the COPYING file in the top-level directory. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 6 | |
Luis Pires | ae63ed1 | 2021-06-23 11:27:43 +0100 | [diff] [blame] | 7 | ================== |
| 8 | Multi-threaded TCG |
| 9 | ================== |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 10 | |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 11 | This document outlines the design for multi-threaded TCG (a.k.a MTTCG) |
| 12 | system-mode emulation. user-mode emulation has always mirrored the |
| 13 | thread structure of the translated executable although some of the |
| 14 | changes done for MTTCG system emulation have improved the stability of |
| 15 | linux-user emulation. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 16 | |
| 17 | The original system-mode TCG implementation was single threaded and |
| 18 | dealt with multiple CPUs with simple round-robin scheduling. This |
| 19 | simplified a lot of things but became increasingly limited as systems |
| 20 | being emulated gained additional cores and per-core performance gains |
| 21 | for host systems started to level off. |
| 22 | |
| 23 | vCPU Scheduling |
| 24 | =============== |
| 25 | |
| 26 | We introduce a new running mode where each vCPU will run on its own |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 27 | user-space thread. This is enabled by default for all FE/BE |
| 28 | combinations where the host memory model is able to accommodate the |
| 29 | guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the |
| 30 | guest has had the required work done to support this safely |
| 31 | (TARGET_SUPPORTS_MTTCG). |
| 32 | |
| 33 | System emulation will fall back to the original round robin approach |
| 34 | if: |
| 35 | |
| 36 | * forced by --accel tcg,thread=single |
| 37 | * enabling --icount mode |
| 38 | * 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 39 | |
| 40 | In the general case of running translated code there should be no |
| 41 | inter-vCPU dependencies and all vCPUs should be able to run at full |
| 42 | speed. Synchronisation will only be required while accessing internal |
| 43 | shared data structures or when the emulated architecture requires a |
| 44 | coherent representation of the emulated machine state. |
| 45 | |
| 46 | Shared Data Structures |
| 47 | ====================== |
| 48 | |
| 49 | Main Run Loop |
| 50 | ------------- |
| 51 | |
| 52 | Even when there is no code being generated there are a number of |
| 53 | structures associated with the hot-path through the main run-loop. |
| 54 | These are associated with looking up the next translation block to |
| 55 | execute. These include: |
| 56 | |
| 57 | tb_jmp_cache (per-vCPU, cache of recent jumps) |
| 58 | tb_ctx.htable (global hash table, phys address->tb lookup) |
| 59 | |
| 60 | As TB linking only occurs when blocks are in the same page this code |
| 61 | is critical to performance as looking up the next TB to execute is the |
| 62 | most common reason to exit the generated code. |
| 63 | |
| 64 | DESIGN REQUIREMENT: Make access to lookup structures safe with |
| 65 | multiple reader/writer threads. Minimise any lock contention to do it. |
| 66 | |
| 67 | The hot-path avoids using locks where possible. The tb_jmp_cache is |
| 68 | updated with atomic accesses to ensure consistent results. The fall |
| 69 | back QHT based hash table is also designed for lockless lookups. Locks |
| 70 | are only taken when code generation is required or TranslationBlocks |
| 71 | have their block-to-block jumps patched. |
| 72 | |
| 73 | Global TCG State |
| 74 | ---------------- |
| 75 | |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 76 | User-mode emulation |
| 77 | ~~~~~~~~~~~~~~~~~~~ |
| 78 | |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 79 | We need to protect the entire code generation cycle including any post |
| 80 | generation patching of the translated code. This also implies a shared |
| 81 | translation buffer which contains code running on all cores. Any |
| 82 | execution path that comes to the main run loop will need to hold a |
| 83 | mutex for code generation. This also includes times when we need flush |
| 84 | code or entries from any shared lookups/caches. Structures held on a |
| 85 | per-vCPU basis won't need locking unless other vCPUs will need to |
| 86 | modify them. |
| 87 | |
| 88 | DESIGN REQUIREMENT: Add locking around all code generation and TB |
| 89 | patching. |
| 90 | |
| 91 | (Current solution) |
| 92 | |
Emilio G. Cota | 0ac2031 | 2017-08-04 23:46:31 -0400 | [diff] [blame] | 93 | Code generation is serialised with mmap_lock(). |
| 94 | |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 95 | !User-mode emulation |
| 96 | ~~~~~~~~~~~~~~~~~~~~ |
| 97 | |
Emilio G. Cota | 0ac2031 | 2017-08-04 23:46:31 -0400 | [diff] [blame] | 98 | Each vCPU has its own TCG context and associated TCG region, thereby |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 99 | requiring no locking during translation. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 100 | |
| 101 | Translation Blocks |
| 102 | ------------------ |
| 103 | |
| 104 | Currently the whole system shares a single code generation buffer |
| 105 | which when full will force a flush of all translations and start from |
| 106 | scratch again. Some operations also force a full flush of translations |
| 107 | including: |
| 108 | |
| 109 | - debugging operations (breakpoint insertion/removal) |
| 110 | - some CPU helper functions |
Alex Bennée | 93154e7 | 2020-07-13 21:04:12 +0100 | [diff] [blame] | 111 | - linux-user spawning its first thread |
Pierrick Bouvier | 02ca5ec | 2024-02-27 14:43:35 +0000 | [diff] [blame] | 112 | - operations related to TCG Plugins |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 113 | |
| 114 | This is done with the async_safe_run_on_cpu() mechanism to ensure all |
| 115 | vCPUs are quiescent when changes are being made to shared global |
| 116 | structures. |
| 117 | |
| 118 | More granular translation invalidation events are typically due |
| 119 | to a change of the state of a physical page: |
| 120 | |
| 121 | - code modification (self modify code, patching code) |
| 122 | - page changes (new page mapping in linux-user mode) |
| 123 | |
| 124 | While setting the invalid flag in a TranslationBlock will stop it |
| 125 | being used when looked up in the hot-path there are a number of other |
| 126 | book-keeping structures that need to be safely cleared. |
| 127 | |
| 128 | Any TranslationBlocks which have been patched to jump directly to the |
| 129 | now invalid blocks need the jump patches reversing so they will return |
| 130 | to the C code. |
| 131 | |
| 132 | There are a number of look-up caches that need to be properly updated |
| 133 | including the: |
| 134 | |
| 135 | - jump lookup cache |
| 136 | - the physical-to-tb lookup hash table |
| 137 | - the global page table |
| 138 | |
| 139 | The global page table (l1_map) which provides a multi-level look-up |
| 140 | for PageDesc structures which contain pointers to the start of a |
| 141 | linked list of all Translation Blocks in that page (see page_next). |
| 142 | |
| 143 | Both the jump patching and the page cache involve linked lists that |
| 144 | the invalidated TranslationBlock needs to be removed from. |
| 145 | |
| 146 | DESIGN REQUIREMENT: Safely handle invalidation of TBs |
| 147 | - safely patch/revert direct jumps |
| 148 | - remove central PageDesc lookup entries |
| 149 | - ensure lookup caches/hashes are safely updated |
| 150 | |
| 151 | (Current solution) |
| 152 | |
| 153 | The direct jump themselves are updated atomically by the TCG |
| 154 | tb_set_jmp_target() code. Modification to the linked lists that allow |
Emilio G. Cota | 194125e | 2017-08-02 20:34:06 -0400 | [diff] [blame] | 155 | searching for linked pages are done under the protection of tb->jmp_lock, |
| 156 | where tb is the destination block of a jump. Each origin block keeps a |
| 157 | pointer to its destinations so that the appropriate lock can be acquired before |
| 158 | iterating over a jump list. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 159 | |
Emilio G. Cota | 78722ed | 2017-07-26 20:15:41 -0400 | [diff] [blame] | 160 | The global page table is a lockless radix tree; cmpxchg is used |
| 161 | to atomically insert new elements. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 162 | |
| 163 | The lookup caches are updated atomically and the lookup hash uses QHT |
| 164 | which is designed for concurrent safe lookup. |
| 165 | |
Emilio G. Cota | 95590e2 | 2017-08-01 15:40:16 -0400 | [diff] [blame] | 166 | Parallel code generation is supported. QHT is used at insertion time |
| 167 | as the synchronization point across threads, thereby ensuring that we only |
| 168 | keep track of a single TranslationBlock for each guest code block. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 169 | |
| 170 | Memory maps and TLBs |
| 171 | -------------------- |
| 172 | |
| 173 | The memory handling code is fairly critical to the speed of memory |
| 174 | access in the emulated system. The SoftMMU code is designed so the |
| 175 | hot-path can be handled entirely within translated code. This is |
| 176 | handled with a per-vCPU TLB structure which once populated will allow |
| 177 | a series of accesses to the page to occur without exiting the |
| 178 | translated code. It is possible to set flags in the TLB address which |
| 179 | will ensure the slow-path is taken for each access. This can be done |
| 180 | to support: |
| 181 | |
| 182 | - Memory regions (dividing up access to PIO, MMIO and RAM) |
| 183 | - Dirty page tracking (for code gen, SMC detection, migration and display) |
| 184 | - Virtual TLB (for translating guest address->real address) |
| 185 | |
| 186 | When the TLB tables are updated by a vCPU thread other than their own |
| 187 | we need to ensure it is done in a safe way so no inconsistent state is |
| 188 | seen by the vCPU thread. |
| 189 | |
| 190 | Some operations require updating a number of vCPUs TLBs at the same |
| 191 | time in a synchronised manner. |
| 192 | |
| 193 | DESIGN REQUIREMENTS: |
| 194 | |
| 195 | - TLB Flush All/Page |
| 196 | - can be across-vCPUs |
| 197 | - cross vCPU TLB flush may need other vCPU brought to halt |
| 198 | - change may need to be visible to the calling vCPU immediately |
| 199 | - TLB Flag Update |
| 200 | - usually cross-vCPU |
| 201 | - want change to be visible as soon as possible |
| 202 | - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) |
| 203 | - This is a per-vCPU table - by definition can't race |
| 204 | - updated by its own thread when the slow-path is forced |
| 205 | |
| 206 | (Current solution) |
| 207 | |
Nicholas Piggin | 99cd12c | 2024-03-27 00:04:20 +1000 | [diff] [blame] | 208 | A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force |
| 209 | synchronisation by setting the source vCPUs work as "safe work" and |
| 210 | exiting the cpu run loop. This ensures that by the time execution |
| 211 | restarts all flush operations have completed. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 212 | |
| 213 | TLB flag updates are all done atomically and are also protected by the |
Emilio G. Cota | 0ac2031 | 2017-08-04 23:46:31 -0400 | [diff] [blame] | 214 | corresponding page lock. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 215 | |
| 216 | (Known limitation) |
| 217 | |
| 218 | Not really a limitation but the wait mechanism is overly strict for |
| 219 | some architectures which only need flushes completed by a barrier |
| 220 | instruction. This could be a future optimisation. |
| 221 | |
| 222 | Emulated hardware state |
| 223 | ----------------------- |
| 224 | |
Stefan Hajnoczi | 0b2675c | 2024-01-02 10:35:29 -0500 | [diff] [blame] | 225 | Currently thanks to KVM work any access to IO memory is automatically protected |
| 226 | by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected |
| 227 | to do its own locking. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 228 | |
| 229 | However IO memory isn't the only way emulated hardware state can be |
| 230 | modified. Some architectures have model specific registers that |
| 231 | trigger hardware emulation features. Generally any translation helper |
| 232 | that needs to update more than a single vCPUs of state should take the |
| 233 | BQL. |
| 234 | |
| 235 | As the BQL, or global iothread mutex is shared across the system we |
| 236 | push the use of the lock as far down into the TCG code as possible to |
| 237 | minimise contention. |
| 238 | |
| 239 | (Current solution) |
| 240 | |
| 241 | MMIO access automatically serialises hardware emulation by way of the |
Peter Maydell | 6fe6d6c | 2020-03-09 21:58:18 +0000 | [diff] [blame] | 242 | BQL. Currently Arm targets serialise all ARM_CP_IO register accesses |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 243 | and also defer the reset/startup of vCPUs to the vCPU context by way |
| 244 | of async_run_on_cpu(). |
| 245 | |
| 246 | Updates to interrupt state are also protected by the BQL as they can |
| 247 | often be cross vCPU. |
| 248 | |
| 249 | Memory Consistency |
| 250 | ================== |
| 251 | |
| 252 | Between emulated guests and host systems there are a range of memory |
| 253 | consistency models. Even emulating weakly ordered systems on strongly |
| 254 | ordered hosts needs to ensure things like store-after-load re-ordering |
| 255 | can be prevented when the guest wants to. |
| 256 | |
| 257 | Memory Barriers |
| 258 | --------------- |
| 259 | |
| 260 | Barriers (sometimes known as fences) provide a mechanism for software |
| 261 | to enforce a particular ordering of memory operations from the point |
| 262 | of view of external observers (e.g. another processor core). They can |
| 263 | apply to any memory operations as well as just loads or stores. |
| 264 | |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 265 | The Linux kernel has an excellent `write-up |
John Snow | 1ec43ca | 2020-10-09 12:15:23 -0400 | [diff] [blame] | 266 | <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ |
Alex Bennée | c8c06e5 | 2020-07-09 15:13:15 +0100 | [diff] [blame] | 267 | on the various forms of memory barrier and the guarantees they can |
| 268 | provide. |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 269 | |
| 270 | Barriers are often wrapped around synchronisation primitives to |
| 271 | provide explicit memory ordering semantics. However they can be used |
| 272 | by themselves to provide safe lockless access by ensuring for example |
| 273 | a change to a signal flag will only be visible once the changes to |
| 274 | payload are. |
| 275 | |
| 276 | DESIGN REQUIREMENT: Add a new tcg_memory_barrier op |
| 277 | |
| 278 | This would enforce a strong load/store ordering so all loads/stores |
| 279 | complete at the memory barrier. On single-core non-SMP strongly |
| 280 | ordered backends this could become a NOP. |
| 281 | |
| 282 | Aside from explicit standalone memory barrier instructions there are |
| 283 | also implicit memory ordering semantics which comes with each guest |
| 284 | memory access instruction. For example all x86 load/stores come with |
Peter Maydell | 6fe6d6c | 2020-03-09 21:58:18 +0000 | [diff] [blame] | 285 | fairly strong guarantees of sequential consistency whereas Arm has |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 286 | special variants of load/store instructions that imply acquire/release |
| 287 | semantics. |
| 288 | |
| 289 | In the case of a strongly ordered guest architecture being emulated on |
| 290 | a weakly ordered host the scope for a heavy performance impact is |
| 291 | quite high. |
| 292 | |
| 293 | DESIGN REQUIREMENTS: Be efficient with use of memory barriers |
| 294 | - host systems with stronger implied guarantees can skip some barriers |
| 295 | - merge consecutive barriers to the strongest one |
| 296 | |
| 297 | (Current solution) |
| 298 | |
| 299 | The system currently has a tcg_gen_mb() which will add memory barrier |
| 300 | operations if code generation is being done in a parallel context. The |
| 301 | tcg_optimize() function attempts to merge barriers up to their |
| 302 | strongest form before any load/store operations. The solution was |
| 303 | originally developed and tested for linux-user based systems. All |
| 304 | backends have been converted to emit fences when required. So far the |
| 305 | following front-ends have been updated to emit fences when required: |
| 306 | |
| 307 | - target-i386 |
| 308 | - target-arm |
| 309 | - target-aarch64 |
| 310 | - target-alpha |
| 311 | - target-mips |
| 312 | |
| 313 | Memory Control and Maintenance |
| 314 | ------------------------------ |
| 315 | |
| 316 | This includes a class of instructions for controlling system cache |
| 317 | behaviour. While QEMU doesn't model cache behaviour these instructions |
| 318 | are often seen when code modification has taken place to ensure the |
| 319 | changes take effect. |
| 320 | |
| 321 | Synchronisation Primitives |
| 322 | -------------------------- |
| 323 | |
| 324 | There are two broad types of synchronisation primitives found in |
| 325 | modern ISAs: atomic instructions and exclusive regions. |
| 326 | |
| 327 | The first type offer a simple atomic instruction which will guarantee |
| 328 | some sort of test and conditional store will be truly atomic w.r.t. |
| 329 | other cores sharing access to the memory. The classic example is the |
| 330 | x86 cmpxchg instruction. |
| 331 | |
| 332 | The second type offer a pair of load/store instructions which offer a |
Ville Skyttä | 9277d81 | 2018-06-12 09:51:50 +0300 | [diff] [blame] | 333 | guarantee that a region of memory has not been touched between the |
Peter Maydell | 6fe6d6c | 2020-03-09 21:58:18 +0000 | [diff] [blame] | 334 | load and store instructions. An example of this is Arm's ldrex/strex |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 335 | pair where the strex instruction will return a flag indicating a |
| 336 | successful store only if no other CPU has accessed the memory region |
| 337 | since the ldrex. |
| 338 | |
| 339 | Traditionally TCG has generated a series of operations that work |
| 340 | because they are within the context of a single translation block so |
| 341 | will have completed before another CPU is scheduled. However with |
| 342 | the ability to have multiple threads running to emulate multiple CPUs |
| 343 | we will need to explicitly expose these semantics. |
| 344 | |
| 345 | DESIGN REQUIREMENTS: |
| 346 | - Support classic atomic instructions |
| 347 | - Support load/store exclusive (or load link/store conditional) pairs |
| 348 | - Generic enough infrastructure to support all guest architectures |
| 349 | CURRENT OPEN QUESTIONS: |
| 350 | - How problematic is the ABA problem in general? |
| 351 | |
| 352 | (Current solution) |
| 353 | |
| 354 | The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which |
| 355 | can be used directly or combined to emulate other instructions like |
Peter Maydell | 6fe6d6c | 2020-03-09 21:58:18 +0000 | [diff] [blame] | 356 | Arm's ldrex/strex instructions. While they are susceptible to the ABA |
Alex Bennée | c6489dd | 2017-02-23 18:29:04 +0000 | [diff] [blame] | 357 | problem so far common guests have not implemented patterns where |
| 358 | this may be a problem - typically presenting a locking ABI which |
| 359 | assumes cmpxchg like semantics. |
| 360 | |
| 361 | The code also includes a fall-back for cases where multi-threaded TCG |
| 362 | ops can't work (e.g. guest atomic width > host atomic width). In this |
| 363 | case an EXCP_ATOMIC exit occurs and the instruction is emulated with |
| 364 | an exclusive lock which ensures all emulation is serialised. |
| 365 | |
| 366 | While the atomic helpers look good enough for now there may be a need |
| 367 | to look at solutions that can more closely model the guest |
| 368 | architectures semantics. |