Blame - docs/devel/multi-thread-tcg.rst - qemu

blob: d706c27ea7466d8754c99274d86c64237ef487cb [file] [log] [blame]

Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	1	..
				2	Copyright (c) 2015-2020 Linaro Ltd.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	3
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	4	This work is licensed under the terms of the GNU GPL, version 2 or
				5	later. See the COPYING file in the top-level directory.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	6
Luis Pires	ae63ed1	2021-06-23 11:27:43 +0100	[diff] [blame]	7	==================
				8	Multi-threaded TCG
				9	==================
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	10
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	11	This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
				12	system-mode emulation. user-mode emulation has always mirrored the
				13	thread structure of the translated executable although some of the
				14	changes done for MTTCG system emulation have improved the stability of
				15	linux-user emulation.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	16
				17	The original system-mode TCG implementation was single threaded and
				18	dealt with multiple CPUs with simple round-robin scheduling. This
				19	simplified a lot of things but became increasingly limited as systems
				20	being emulated gained additional cores and per-core performance gains
				21	for host systems started to level off.
				22
				23	vCPU Scheduling
				24	===============
				25
				26	We introduce a new running mode where each vCPU will run on its own
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	27	user-space thread. This is enabled by default for all FE/BE
				28	combinations where the host memory model is able to accommodate the
				29	guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
				30	guest has had the required work done to support this safely
				31	(TARGET_SUPPORTS_MTTCG).
				32
				33	System emulation will fall back to the original round robin approach
				34	if:
				35
				36	* forced by --accel tcg,thread=single
				37	* enabling --icount mode
				38	* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	39
				40	In the general case of running translated code there should be no
				41	inter-vCPU dependencies and all vCPUs should be able to run at full
				42	speed. Synchronisation will only be required while accessing internal
				43	shared data structures or when the emulated architecture requires a
				44	coherent representation of the emulated machine state.
				45
				46	Shared Data Structures
				47	======================
				48
				49	Main Run Loop
				50	-------------
				51
				52	Even when there is no code being generated there are a number of
				53	structures associated with the hot-path through the main run-loop.
				54	These are associated with looking up the next translation block to
				55	execute. These include:
				56
				57	tb_jmp_cache (per-vCPU, cache of recent jumps)
				58	tb_ctx.htable (global hash table, phys address->tb lookup)
				59
				60	As TB linking only occurs when blocks are in the same page this code
				61	is critical to performance as looking up the next TB to execute is the
				62	most common reason to exit the generated code.
				63
				64	DESIGN REQUIREMENT: Make access to lookup structures safe with
				65	multiple reader/writer threads. Minimise any lock contention to do it.
				66
				67	The hot-path avoids using locks where possible. The tb_jmp_cache is
				68	updated with atomic accesses to ensure consistent results. The fall
				69	back QHT based hash table is also designed for lockless lookups. Locks
				70	are only taken when code generation is required or TranslationBlocks
				71	have their block-to-block jumps patched.
				72
				73	Global TCG State
				74	----------------
				75
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	76	User-mode emulation
				77	~~~~~~~~~~~~~~~~~~~
				78
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	79	We need to protect the entire code generation cycle including any post
				80	generation patching of the translated code. This also implies a shared
				81	translation buffer which contains code running on all cores. Any
				82	execution path that comes to the main run loop will need to hold a
				83	mutex for code generation. This also includes times when we need flush
				84	code or entries from any shared lookups/caches. Structures held on a
				85	per-vCPU basis won't need locking unless other vCPUs will need to
				86	modify them.
				87
				88	DESIGN REQUIREMENT: Add locking around all code generation and TB
				89	patching.
				90
				91	(Current solution)
				92
Emilio G. Cota	0ac2031	2017-08-04 23:46:31 -0400	[diff] [blame]	93	Code generation is serialised with mmap_lock().
				94
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	95	!User-mode emulation
				96	~~~~~~~~~~~~~~~~~~~~
				97
Emilio G. Cota	0ac2031	2017-08-04 23:46:31 -0400	[diff] [blame]	98	Each vCPU has its own TCG context and associated TCG region, thereby
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	99	requiring no locking during translation.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	100
				101	Translation Blocks
				102	------------------
				103
				104	Currently the whole system shares a single code generation buffer
				105	which when full will force a flush of all translations and start from
				106	scratch again. Some operations also force a full flush of translations
				107	including:
				108
				109	- debugging operations (breakpoint insertion/removal)
				110	- some CPU helper functions
Alex Bennée	93154e7	2020-07-13 21:04:12 +0100	[diff] [blame]	111	- linux-user spawning its first thread
Pierrick Bouvier	02ca5ec	2024-02-27 14:43:35 +0000	[diff] [blame]	112	- operations related to TCG Plugins
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	113
				114	This is done with the async_safe_run_on_cpu() mechanism to ensure all
				115	vCPUs are quiescent when changes are being made to shared global
				116	structures.
				117
				118	More granular translation invalidation events are typically due
				119	to a change of the state of a physical page:
				120
				121	- code modification (self modify code, patching code)
				122	- page changes (new page mapping in linux-user mode)
				123
				124	While setting the invalid flag in a TranslationBlock will stop it
				125	being used when looked up in the hot-path there are a number of other
				126	book-keeping structures that need to be safely cleared.
				127
				128	Any TranslationBlocks which have been patched to jump directly to the
				129	now invalid blocks need the jump patches reversing so they will return
				130	to the C code.
				131
				132	There are a number of look-up caches that need to be properly updated
				133	including the:
				134
				135	- jump lookup cache
				136	- the physical-to-tb lookup hash table
				137	- the global page table
				138
				139	The global page table (l1_map) which provides a multi-level look-up
				140	for PageDesc structures which contain pointers to the start of a
				141	linked list of all Translation Blocks in that page (see page_next).
				142
				143	Both the jump patching and the page cache involve linked lists that
				144	the invalidated TranslationBlock needs to be removed from.
				145
				146	DESIGN REQUIREMENT: Safely handle invalidation of TBs
				147	- safely patch/revert direct jumps
				148	- remove central PageDesc lookup entries
				149	- ensure lookup caches/hashes are safely updated
				150
				151	(Current solution)
				152
				153	The direct jump themselves are updated atomically by the TCG
				154	tb_set_jmp_target() code. Modification to the linked lists that allow
Emilio G. Cota	194125e	2017-08-02 20:34:06 -0400	[diff] [blame]	155	searching for linked pages are done under the protection of tb->jmp_lock,
				156	where tb is the destination block of a jump. Each origin block keeps a
				157	pointer to its destinations so that the appropriate lock can be acquired before
				158	iterating over a jump list.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	159
Emilio G. Cota	78722ed	2017-07-26 20:15:41 -0400	[diff] [blame]	160	The global page table is a lockless radix tree; cmpxchg is used
				161	to atomically insert new elements.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	162
				163	The lookup caches are updated atomically and the lookup hash uses QHT
				164	which is designed for concurrent safe lookup.
				165
Emilio G. Cota	95590e2	2017-08-01 15:40:16 -0400	[diff] [blame]	166	Parallel code generation is supported. QHT is used at insertion time
				167	as the synchronization point across threads, thereby ensuring that we only
				168	keep track of a single TranslationBlock for each guest code block.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	169
				170	Memory maps and TLBs
				171	--------------------
				172
				173	The memory handling code is fairly critical to the speed of memory
				174	access in the emulated system. The SoftMMU code is designed so the
				175	hot-path can be handled entirely within translated code. This is
				176	handled with a per-vCPU TLB structure which once populated will allow
				177	a series of accesses to the page to occur without exiting the
				178	translated code. It is possible to set flags in the TLB address which
				179	will ensure the slow-path is taken for each access. This can be done
				180	to support:
				181
				182	- Memory regions (dividing up access to PIO, MMIO and RAM)
				183	- Dirty page tracking (for code gen, SMC detection, migration and display)
				184	- Virtual TLB (for translating guest address->real address)
				185
				186	When the TLB tables are updated by a vCPU thread other than their own
				187	we need to ensure it is done in a safe way so no inconsistent state is
				188	seen by the vCPU thread.
				189
				190	Some operations require updating a number of vCPUs TLBs at the same
				191	time in a synchronised manner.
				192
				193	DESIGN REQUIREMENTS:
				194
				195	- TLB Flush All/Page
				196	- can be across-vCPUs
				197	- cross vCPU TLB flush may need other vCPU brought to halt
				198	- change may need to be visible to the calling vCPU immediately
				199	- TLB Flag Update
				200	- usually cross-vCPU
				201	- want change to be visible as soon as possible
				202	- TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
				203	- This is a per-vCPU table - by definition can't race
				204	- updated by its own thread when the slow-path is forced
				205
				206	(Current solution)
				207
Nicholas Piggin	99cd12c	2024-03-27 00:04:20 +1000	[diff] [blame]	208	A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force
				209	synchronisation by setting the source vCPUs work as "safe work" and
				210	exiting the cpu run loop. This ensures that by the time execution
				211	restarts all flush operations have completed.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	212
				213	TLB flag updates are all done atomically and are also protected by the
Emilio G. Cota	0ac2031	2017-08-04 23:46:31 -0400	[diff] [blame]	214	corresponding page lock.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	215
				216	(Known limitation)
				217
				218	Not really a limitation but the wait mechanism is overly strict for
				219	some architectures which only need flushes completed by a barrier
				220	instruction. This could be a future optimisation.
				221
				222	Emulated hardware state
				223	-----------------------
				224
Stefan Hajnoczi	0b2675c	2024-01-02 10:35:29 -0500	[diff] [blame]	225	Currently thanks to KVM work any access to IO memory is automatically protected
				226	by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
				227	to do its own locking.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	228
				229	However IO memory isn't the only way emulated hardware state can be
				230	modified. Some architectures have model specific registers that
				231	trigger hardware emulation features. Generally any translation helper
				232	that needs to update more than a single vCPUs of state should take the
				233	BQL.
				234
				235	As the BQL, or global iothread mutex is shared across the system we
				236	push the use of the lock as far down into the TCG code as possible to
				237	minimise contention.
				238
				239	(Current solution)
				240
				241	MMIO access automatically serialises hardware emulation by way of the
Peter Maydell	6fe6d6c	2020-03-09 21:58:18 +0000	[diff] [blame]	242	BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	243	and also defer the reset/startup of vCPUs to the vCPU context by way
				244	of async_run_on_cpu().
				245
				246	Updates to interrupt state are also protected by the BQL as they can
				247	often be cross vCPU.
				248
				249	Memory Consistency
				250	==================
				251
				252	Between emulated guests and host systems there are a range of memory
				253	consistency models. Even emulating weakly ordered systems on strongly
				254	ordered hosts needs to ensure things like store-after-load re-ordering
				255	can be prevented when the guest wants to.
				256
				257	Memory Barriers
				258	---------------
				259
				260	Barriers (sometimes known as fences) provide a mechanism for software
				261	to enforce a particular ordering of memory operations from the point
				262	of view of external observers (e.g. another processor core). They can
				263	apply to any memory operations as well as just loads or stores.
				264
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	265	The Linux kernel has an excellent `write-up
John Snow	1ec43ca	2020-10-09 12:15:23 -0400	[diff] [blame]	266	<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
Alex Bennée	c8c06e5	2020-07-09 15:13:15 +0100	[diff] [blame]	267	on the various forms of memory barrier and the guarantees they can
				268	provide.
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	269
				270	Barriers are often wrapped around synchronisation primitives to
				271	provide explicit memory ordering semantics. However they can be used
				272	by themselves to provide safe lockless access by ensuring for example
				273	a change to a signal flag will only be visible once the changes to
				274	payload are.
				275
				276	DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
				277
				278	This would enforce a strong load/store ordering so all loads/stores
				279	complete at the memory barrier. On single-core non-SMP strongly
				280	ordered backends this could become a NOP.
				281
				282	Aside from explicit standalone memory barrier instructions there are
				283	also implicit memory ordering semantics which comes with each guest
				284	memory access instruction. For example all x86 load/stores come with
Peter Maydell	6fe6d6c	2020-03-09 21:58:18 +0000	[diff] [blame]	285	fairly strong guarantees of sequential consistency whereas Arm has
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	286	special variants of load/store instructions that imply acquire/release
				287	semantics.
				288
				289	In the case of a strongly ordered guest architecture being emulated on
				290	a weakly ordered host the scope for a heavy performance impact is
				291	quite high.
				292
				293	DESIGN REQUIREMENTS: Be efficient with use of memory barriers
				294	- host systems with stronger implied guarantees can skip some barriers
				295	- merge consecutive barriers to the strongest one
				296
				297	(Current solution)
				298
				299	The system currently has a tcg_gen_mb() which will add memory barrier
				300	operations if code generation is being done in a parallel context. The
				301	tcg_optimize() function attempts to merge barriers up to their
				302	strongest form before any load/store operations. The solution was
				303	originally developed and tested for linux-user based systems. All
				304	backends have been converted to emit fences when required. So far the
				305	following front-ends have been updated to emit fences when required:
				306
				307	- target-i386
				308	- target-arm
				309	- target-aarch64
				310	- target-alpha
				311	- target-mips
				312
				313	Memory Control and Maintenance
				314	------------------------------
				315
				316	This includes a class of instructions for controlling system cache
				317	behaviour. While QEMU doesn't model cache behaviour these instructions
				318	are often seen when code modification has taken place to ensure the
				319	changes take effect.
				320
				321	Synchronisation Primitives
				322	--------------------------
				323
				324	There are two broad types of synchronisation primitives found in
				325	modern ISAs: atomic instructions and exclusive regions.
				326
				327	The first type offer a simple atomic instruction which will guarantee
				328	some sort of test and conditional store will be truly atomic w.r.t.
				329	other cores sharing access to the memory. The classic example is the
				330	x86 cmpxchg instruction.
				331
				332	The second type offer a pair of load/store instructions which offer a
Ville Skyttä	9277d81	2018-06-12 09:51:50 +0300	[diff] [blame]	333	guarantee that a region of memory has not been touched between the
Peter Maydell	6fe6d6c	2020-03-09 21:58:18 +0000	[diff] [blame]	334	load and store instructions. An example of this is Arm's ldrex/strex
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	335	pair where the strex instruction will return a flag indicating a
				336	successful store only if no other CPU has accessed the memory region
				337	since the ldrex.
				338
				339	Traditionally TCG has generated a series of operations that work
				340	because they are within the context of a single translation block so
				341	will have completed before another CPU is scheduled. However with
				342	the ability to have multiple threads running to emulate multiple CPUs
				343	we will need to explicitly expose these semantics.
				344
				345	DESIGN REQUIREMENTS:
				346	- Support classic atomic instructions
				347	- Support load/store exclusive (or load link/store conditional) pairs
				348	- Generic enough infrastructure to support all guest architectures
				349	CURRENT OPEN QUESTIONS:
				350	- How problematic is the ABA problem in general?
				351
				352	(Current solution)
				353
				354	The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
				355	can be used directly or combined to emulate other instructions like
Peter Maydell	6fe6d6c	2020-03-09 21:58:18 +0000	[diff] [blame]	356	Arm's ldrex/strex instructions. While they are susceptible to the ABA
Alex Bennée	c6489dd	2017-02-23 18:29:04 +0000	[diff] [blame]	357	problem so far common guests have not implemented patterns where
				358	this may be a problem - typically presenting a locking ABI which
				359	assumes cmpxchg like semantics.
				360
				361	The code also includes a fall-back for cases where multi-threaded TCG
				362	ops can't work (e.g. guest atomic width > host atomic width). In this
				363	case an EXCP_ATOMIC exit occurs and the instruction is emulated with
				364	an exclusive lock which ensures all emulation is serialised.
				365
				366	While the atomic helpers look good enough for now there may be a need
				367	to look at solutions that can more closely model the guest
				368	architectures semantics.