|  | .. | 
|  | Copyright (c) 2015-2020 Linaro Ltd. | 
|  |  | 
|  | This work is licensed under the terms of the GNU GPL, version 2 or | 
|  | later. See the COPYING file in the top-level directory. | 
|  |  | 
|  | ================== | 
|  | Multi-threaded TCG | 
|  | ================== | 
|  |  | 
|  | This document outlines the design for multi-threaded TCG (a.k.a MTTCG) | 
|  | system-mode emulation. user-mode emulation has always mirrored the | 
|  | thread structure of the translated executable although some of the | 
|  | changes done for MTTCG system emulation have improved the stability of | 
|  | linux-user emulation. | 
|  |  | 
|  | The original system-mode TCG implementation was single threaded and | 
|  | dealt with multiple CPUs with simple round-robin scheduling. This | 
|  | simplified a lot of things but became increasingly limited as systems | 
|  | being emulated gained additional cores and per-core performance gains | 
|  | for host systems started to level off. | 
|  |  | 
|  | vCPU Scheduling | 
|  | =============== | 
|  |  | 
|  | We introduce a new running mode where each vCPU will run on its own | 
|  | user-space thread. This is enabled by default for all FE/BE | 
|  | combinations where the host memory model is able to accommodate the | 
|  | guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the | 
|  | guest has had the required work done to support this safely | 
|  | (TARGET_SUPPORTS_MTTCG). | 
|  |  | 
|  | System emulation will fall back to the original round robin approach | 
|  | if: | 
|  |  | 
|  | * forced by --accel tcg,thread=single | 
|  | * enabling --icount mode | 
|  | * 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) | 
|  |  | 
|  | In the general case of running translated code there should be no | 
|  | inter-vCPU dependencies and all vCPUs should be able to run at full | 
|  | speed. Synchronisation will only be required while accessing internal | 
|  | shared data structures or when the emulated architecture requires a | 
|  | coherent representation of the emulated machine state. | 
|  |  | 
|  | Shared Data Structures | 
|  | ====================== | 
|  |  | 
|  | Main Run Loop | 
|  | ------------- | 
|  |  | 
|  | Even when there is no code being generated there are a number of | 
|  | structures associated with the hot-path through the main run-loop. | 
|  | These are associated with looking up the next translation block to | 
|  | execute. These include: | 
|  |  | 
|  | tb_jmp_cache (per-vCPU, cache of recent jumps) | 
|  | tb_ctx.htable (global hash table, phys address->tb lookup) | 
|  |  | 
|  | As TB linking only occurs when blocks are in the same page this code | 
|  | is critical to performance as looking up the next TB to execute is the | 
|  | most common reason to exit the generated code. | 
|  |  | 
|  | DESIGN REQUIREMENT: Make access to lookup structures safe with | 
|  | multiple reader/writer threads. Minimise any lock contention to do it. | 
|  |  | 
|  | The hot-path avoids using locks where possible. The tb_jmp_cache is | 
|  | updated with atomic accesses to ensure consistent results. The fall | 
|  | back QHT based hash table is also designed for lockless lookups. Locks | 
|  | are only taken when code generation is required or TranslationBlocks | 
|  | have their block-to-block jumps patched. | 
|  |  | 
|  | Global TCG State | 
|  | ---------------- | 
|  |  | 
|  | User-mode emulation | 
|  | ~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | We need to protect the entire code generation cycle including any post | 
|  | generation patching of the translated code. This also implies a shared | 
|  | translation buffer which contains code running on all cores. Any | 
|  | execution path that comes to the main run loop will need to hold a | 
|  | mutex for code generation. This also includes times when we need flush | 
|  | code or entries from any shared lookups/caches. Structures held on a | 
|  | per-vCPU basis won't need locking unless other vCPUs will need to | 
|  | modify them. | 
|  |  | 
|  | DESIGN REQUIREMENT: Add locking around all code generation and TB | 
|  | patching. | 
|  |  | 
|  | (Current solution) | 
|  |  | 
|  | Code generation is serialised with mmap_lock(). | 
|  |  | 
|  | !User-mode emulation | 
|  | ~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Each vCPU has its own TCG context and associated TCG region, thereby | 
|  | requiring no locking during translation. | 
|  |  | 
|  | Translation Blocks | 
|  | ------------------ | 
|  |  | 
|  | Currently the whole system shares a single code generation buffer | 
|  | which when full will force a flush of all translations and start from | 
|  | scratch again. Some operations also force a full flush of translations | 
|  | including: | 
|  |  | 
|  | - debugging operations (breakpoint insertion/removal) | 
|  | - some CPU helper functions | 
|  | - linux-user spawning its first thread | 
|  | - operations related to TCG Plugins | 
|  |  | 
|  | This is done with the async_safe_run_on_cpu() mechanism to ensure all | 
|  | vCPUs are quiescent when changes are being made to shared global | 
|  | structures. | 
|  |  | 
|  | More granular translation invalidation events are typically due | 
|  | to a change of the state of a physical page: | 
|  |  | 
|  | - code modification (self modify code, patching code) | 
|  | - page changes (new page mapping in linux-user mode) | 
|  |  | 
|  | While setting the invalid flag in a TranslationBlock will stop it | 
|  | being used when looked up in the hot-path there are a number of other | 
|  | book-keeping structures that need to be safely cleared. | 
|  |  | 
|  | Any TranslationBlocks which have been patched to jump directly to the | 
|  | now invalid blocks need the jump patches reversing so they will return | 
|  | to the C code. | 
|  |  | 
|  | There are a number of look-up caches that need to be properly updated | 
|  | including the: | 
|  |  | 
|  | - jump lookup cache | 
|  | - the physical-to-tb lookup hash table | 
|  | - the global page table | 
|  |  | 
|  | The global page table (l1_map) which provides a multi-level look-up | 
|  | for PageDesc structures which contain pointers to the start of a | 
|  | linked list of all Translation Blocks in that page (see page_next). | 
|  |  | 
|  | Both the jump patching and the page cache involve linked lists that | 
|  | the invalidated TranslationBlock needs to be removed from. | 
|  |  | 
|  | DESIGN REQUIREMENT: Safely handle invalidation of TBs | 
|  | - safely patch/revert direct jumps | 
|  | - remove central PageDesc lookup entries | 
|  | - ensure lookup caches/hashes are safely updated | 
|  |  | 
|  | (Current solution) | 
|  |  | 
|  | The direct jump themselves are updated atomically by the TCG | 
|  | tb_set_jmp_target() code. Modification to the linked lists that allow | 
|  | searching for linked pages are done under the protection of tb->jmp_lock, | 
|  | where tb is the destination block of a jump. Each origin block keeps a | 
|  | pointer to its destinations so that the appropriate lock can be acquired before | 
|  | iterating over a jump list. | 
|  |  | 
|  | The global page table is a lockless radix tree; cmpxchg is used | 
|  | to atomically insert new elements. | 
|  |  | 
|  | The lookup caches are updated atomically and the lookup hash uses QHT | 
|  | which is designed for concurrent safe lookup. | 
|  |  | 
|  | Parallel code generation is supported. QHT is used at insertion time | 
|  | as the synchronization point across threads, thereby ensuring that we only | 
|  | keep track of a single TranslationBlock for each guest code block. | 
|  |  | 
|  | Memory maps and TLBs | 
|  | -------------------- | 
|  |  | 
|  | The memory handling code is fairly critical to the speed of memory | 
|  | access in the emulated system. The SoftMMU code is designed so the | 
|  | hot-path can be handled entirely within translated code. This is | 
|  | handled with a per-vCPU TLB structure which once populated will allow | 
|  | a series of accesses to the page to occur without exiting the | 
|  | translated code. It is possible to set flags in the TLB address which | 
|  | will ensure the slow-path is taken for each access. This can be done | 
|  | to support: | 
|  |  | 
|  | - Memory regions (dividing up access to PIO, MMIO and RAM) | 
|  | - Dirty page tracking (for code gen, SMC detection, migration and display) | 
|  | - Virtual TLB (for translating guest address->real address) | 
|  |  | 
|  | When the TLB tables are updated by a vCPU thread other than their own | 
|  | we need to ensure it is done in a safe way so no inconsistent state is | 
|  | seen by the vCPU thread. | 
|  |  | 
|  | Some operations require updating a number of vCPUs TLBs at the same | 
|  | time in a synchronised manner. | 
|  |  | 
|  | DESIGN REQUIREMENTS: | 
|  |  | 
|  | - TLB Flush All/Page | 
|  | - can be across-vCPUs | 
|  | - cross vCPU TLB flush may need other vCPU brought to halt | 
|  | - change may need to be visible to the calling vCPU immediately | 
|  | - TLB Flag Update | 
|  | - usually cross-vCPU | 
|  | - want change to be visible as soon as possible | 
|  | - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) | 
|  | - This is a per-vCPU table - by definition can't race | 
|  | - updated by its own thread when the slow-path is forced | 
|  |  | 
|  | (Current solution) | 
|  |  | 
|  | A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force | 
|  | synchronisation by setting the source vCPUs work as "safe work" and | 
|  | exiting the cpu run loop. This ensures that by the time execution | 
|  | restarts all flush operations have completed. | 
|  |  | 
|  | TLB flag updates are all done atomically and are also protected by the | 
|  | corresponding page lock. | 
|  |  | 
|  | (Known limitation) | 
|  |  | 
|  | Not really a limitation but the wait mechanism is overly strict for | 
|  | some architectures which only need flushes completed by a barrier | 
|  | instruction. This could be a future optimisation. | 
|  |  | 
|  | Emulated hardware state | 
|  | ----------------------- | 
|  |  | 
|  | Currently thanks to KVM work any access to IO memory is automatically protected | 
|  | by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected | 
|  | to do its own locking. | 
|  |  | 
|  | However IO memory isn't the only way emulated hardware state can be | 
|  | modified. Some architectures have model specific registers that | 
|  | trigger hardware emulation features. Generally any translation helper | 
|  | that needs to update more than a single vCPUs of state should take the | 
|  | BQL. | 
|  |  | 
|  | As the BQL, or global iothread mutex is shared across the system we | 
|  | push the use of the lock as far down into the TCG code as possible to | 
|  | minimise contention. | 
|  |  | 
|  | (Current solution) | 
|  |  | 
|  | MMIO access automatically serialises hardware emulation by way of the | 
|  | BQL. Currently Arm targets serialise all ARM_CP_IO register accesses | 
|  | and also defer the reset/startup of vCPUs to the vCPU context by way | 
|  | of async_run_on_cpu(). | 
|  |  | 
|  | Updates to interrupt state are also protected by the BQL as they can | 
|  | often be cross vCPU. | 
|  |  | 
|  | Memory Consistency | 
|  | ================== | 
|  |  | 
|  | Between emulated guests and host systems there are a range of memory | 
|  | consistency models. Even emulating weakly ordered systems on strongly | 
|  | ordered hosts needs to ensure things like store-after-load re-ordering | 
|  | can be prevented when the guest wants to. | 
|  |  | 
|  | Memory Barriers | 
|  | --------------- | 
|  |  | 
|  | Barriers (sometimes known as fences) provide a mechanism for software | 
|  | to enforce a particular ordering of memory operations from the point | 
|  | of view of external observers (e.g. another processor core). They can | 
|  | apply to any memory operations as well as just loads or stores. | 
|  |  | 
|  | The Linux kernel has an excellent `write-up | 
|  | <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ | 
|  | on the various forms of memory barrier and the guarantees they can | 
|  | provide. | 
|  |  | 
|  | Barriers are often wrapped around synchronisation primitives to | 
|  | provide explicit memory ordering semantics. However they can be used | 
|  | by themselves to provide safe lockless access by ensuring for example | 
|  | a change to a signal flag will only be visible once the changes to | 
|  | payload are. | 
|  |  | 
|  | DESIGN REQUIREMENT: Add a new tcg_memory_barrier op | 
|  |  | 
|  | This would enforce a strong load/store ordering so all loads/stores | 
|  | complete at the memory barrier. On single-core non-SMP strongly | 
|  | ordered backends this could become a NOP. | 
|  |  | 
|  | Aside from explicit standalone memory barrier instructions there are | 
|  | also implicit memory ordering semantics which comes with each guest | 
|  | memory access instruction. For example all x86 load/stores come with | 
|  | fairly strong guarantees of sequential consistency whereas Arm has | 
|  | special variants of load/store instructions that imply acquire/release | 
|  | semantics. | 
|  |  | 
|  | In the case of a strongly ordered guest architecture being emulated on | 
|  | a weakly ordered host the scope for a heavy performance impact is | 
|  | quite high. | 
|  |  | 
|  | DESIGN REQUIREMENTS: Be efficient with use of memory barriers | 
|  | - host systems with stronger implied guarantees can skip some barriers | 
|  | - merge consecutive barriers to the strongest one | 
|  |  | 
|  | (Current solution) | 
|  |  | 
|  | The system currently has a tcg_gen_mb() which will add memory barrier | 
|  | operations if code generation is being done in a parallel context. The | 
|  | tcg_optimize() function attempts to merge barriers up to their | 
|  | strongest form before any load/store operations. The solution was | 
|  | originally developed and tested for linux-user based systems. All | 
|  | backends have been converted to emit fences when required. So far the | 
|  | following front-ends have been updated to emit fences when required: | 
|  |  | 
|  | - target-i386 | 
|  | - target-arm | 
|  | - target-aarch64 | 
|  | - target-alpha | 
|  | - target-mips | 
|  |  | 
|  | Memory Control and Maintenance | 
|  | ------------------------------ | 
|  |  | 
|  | This includes a class of instructions for controlling system cache | 
|  | behaviour. While QEMU doesn't model cache behaviour these instructions | 
|  | are often seen when code modification has taken place to ensure the | 
|  | changes take effect. | 
|  |  | 
|  | Synchronisation Primitives | 
|  | -------------------------- | 
|  |  | 
|  | There are two broad types of synchronisation primitives found in | 
|  | modern ISAs: atomic instructions and exclusive regions. | 
|  |  | 
|  | The first type offer a simple atomic instruction which will guarantee | 
|  | some sort of test and conditional store will be truly atomic w.r.t. | 
|  | other cores sharing access to the memory. The classic example is the | 
|  | x86 cmpxchg instruction. | 
|  |  | 
|  | The second type offer a pair of load/store instructions which offer a | 
|  | guarantee that a region of memory has not been touched between the | 
|  | load and store instructions. An example of this is Arm's ldrex/strex | 
|  | pair where the strex instruction will return a flag indicating a | 
|  | successful store only if no other CPU has accessed the memory region | 
|  | since the ldrex. | 
|  |  | 
|  | Traditionally TCG has generated a series of operations that work | 
|  | because they are within the context of a single translation block so | 
|  | will have completed before another CPU is scheduled. However with | 
|  | the ability to have multiple threads running to emulate multiple CPUs | 
|  | we will need to explicitly expose these semantics. | 
|  |  | 
|  | DESIGN REQUIREMENTS: | 
|  | - Support classic atomic instructions | 
|  | - Support load/store exclusive (or load link/store conditional) pairs | 
|  | - Generic enough infrastructure to support all guest architectures | 
|  | CURRENT OPEN QUESTIONS: | 
|  | - How problematic is the ABA problem in general? | 
|  |  | 
|  | (Current solution) | 
|  |  | 
|  | The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which | 
|  | can be used directly or combined to emulate other instructions like | 
|  | Arm's ldrex/strex instructions. While they are susceptible to the ABA | 
|  | problem so far common guests have not implemented patterns where | 
|  | this may be a problem - typically presenting a locking ABI which | 
|  | assumes cmpxchg like semantics. | 
|  |  | 
|  | The code also includes a fall-back for cases where multi-threaded TCG | 
|  | ops can't work (e.g. guest atomic width > host atomic width). In this | 
|  | case an EXCP_ATOMIC exit occurs and the instruction is emulated with | 
|  | an exclusive lock which ensures all emulation is serialised. | 
|  |  | 
|  | While the atomic helpers look good enough for now there may be a need | 
|  | to look at solutions that can more closely model the guest | 
|  | architectures semantics. |