| .. _tcg_internals: | 
 |  | 
 | ==================== | 
 | Translator Internals | 
 | ==================== | 
 |  | 
 | QEMU is a dynamic translator. When it first encounters a piece of code, | 
 | it converts it to the host instruction set. Usually dynamic translators | 
 | are very complicated and highly CPU dependent. QEMU uses some tricks | 
 | which make it relatively easily portable and simple while achieving good | 
 | performances. | 
 |  | 
 | QEMU's dynamic translation backend is called TCG, for "Tiny Code | 
 | Generator". For more information, please take a look at :ref:`tcg-ops-ref`. | 
 |  | 
 | The following sections outline some notable features and implementation | 
 | details of QEMU's dynamic translator. | 
 |  | 
 | CPU state optimisations | 
 | ----------------------- | 
 |  | 
 | The target CPUs have many internal states which change the way they | 
 | evaluate instructions. In order to achieve a good speed, the | 
 | translation phase considers that some state information of the virtual | 
 | CPU cannot change in it. The state is recorded in the Translation | 
 | Block (TB). If the state changes (e.g. privilege level), a new TB will | 
 | be generated and the previous TB won't be used anymore until the state | 
 | matches the state recorded in the previous TB. The same idea can be applied | 
 | to other aspects of the CPU state.  For example, on x86, if the SS, | 
 | DS and ES segments have a zero base, then the translator does not even | 
 | generate an addition for the segment base. | 
 |  | 
 | Direct block chaining | 
 | --------------------- | 
 |  | 
 | After each translated basic block is executed, QEMU uses the simulated | 
 | Program Counter (PC) and other CPU state information (such as the CS | 
 | segment base value) to find the next basic block. | 
 |  | 
 | In its simplest, less optimized form, this is done by exiting from the | 
 | current TB, going through the TB epilogue, and then back to the | 
 | main loop. That’s where QEMU looks for the next TB to execute, | 
 | translating it from the guest architecture if it isn’t already available | 
 | in memory. Then QEMU proceeds to execute this next TB, starting at the | 
 | prologue and then moving on to the translated instructions. | 
 |  | 
 | Exiting from the TB this way will cause the ``cpu_exec_interrupt()`` | 
 | callback to be re-evaluated before executing additional instructions. | 
 | It is mandatory to exit this way after any CPU state changes that may | 
 | unmask interrupts. | 
 |  | 
 | In order to accelerate the cases where the TB for the new | 
 | simulated PC is already available, QEMU has mechanisms that allow | 
 | multiple TBs to be chained directly, without having to go back to the | 
 | main loop as described above. These mechanisms are: | 
 |  | 
 | ``lookup_and_goto_ptr`` | 
 | ^^^^^^^^^^^^^^^^^^^^^^^ | 
 |  | 
 | Calling ``tcg_gen_lookup_and_goto_ptr()`` will emit a call to | 
 | ``helper_lookup_tb_ptr``. This helper will look for an existing TB that | 
 | matches the current CPU state. If the destination TB is available its | 
 | code address is returned, otherwise the address of the JIT epilogue is | 
 | returned. The call to the helper is always followed by the tcg ``goto_ptr`` | 
 | opcode, which branches to the returned address. In this way, we either | 
 | branch to the next TB or return to the main loop. | 
 |  | 
 | ``goto_tb + exit_tb`` | 
 | ^^^^^^^^^^^^^^^^^^^^^ | 
 |  | 
 | The translation code usually implements branching by performing the | 
 | following steps: | 
 |  | 
 | 1. Call ``tcg_gen_goto_tb()`` passing a jump slot index (either 0 or 1) | 
 |    as a parameter. | 
 |  | 
 | 2. Emit TCG instructions to update the CPU state with any information | 
 |    that has been assumed constant and is required by the main loop to | 
 |    correctly locate and execute the next TB. For most guests, this is | 
 |    just the PC of the branch destination, but others may store additional | 
 |    data. The information updated in this step must be inferable from both | 
 |    ``cpu_get_tb_cpu_state()`` and ``cpu_restore_state()``. | 
 |  | 
 | 3. Call ``tcg_gen_exit_tb()`` passing the address of the current TB and | 
 |    the jump slot index again. | 
 |  | 
 | Step 1, ``tcg_gen_goto_tb()``, will emit a ``goto_tb`` TCG | 
 | instruction that later on gets translated to a jump to an address | 
 | associated with the specified jump slot. Initially, this is the address | 
 | of step 2's instructions, which update the CPU state information. Step 3, | 
 | ``tcg_gen_exit_tb()``, exits from the current TB returning a tagged | 
 | pointer composed of the last executed TB’s address and the jump slot | 
 | index. | 
 |  | 
 | The first time this whole sequence is executed, step 1 simply jumps | 
 | to step 2. Then the CPU state information gets updated and we exit from | 
 | the current TB. As a result, the behavior is very similar to the less | 
 | optimized form described earlier in this section. | 
 |  | 
 | Next, the main loop looks for the next TB to execute using the | 
 | current CPU state information (creating the TB if it wasn’t already | 
 | available) and, before starting to execute the new TB’s instructions, | 
 | patches the previously executed TB by associating one of its jump | 
 | slots (the one specified in the call to ``tcg_gen_exit_tb()``) with the | 
 | address of the new TB. | 
 |  | 
 | The next time this previous TB is executed and we get to that same | 
 | ``goto_tb`` step, it will already be patched (assuming the destination TB | 
 | is still in memory) and will jump directly to the first instruction of | 
 | the destination TB, without going back to the main loop. | 
 |  | 
 | For the ``goto_tb + exit_tb`` mechanism to be used, the following | 
 | conditions need to be satisfied: | 
 |  | 
 | * The change in CPU state must be constant, e.g., a direct branch and | 
 |   not an indirect branch. | 
 |  | 
 | * The direct branch cannot cross a page boundary. Memory mappings | 
 |   may change, causing the code at the destination address to change. | 
 |  | 
 | Note that, on step 3 (``tcg_gen_exit_tb()``), in addition to the | 
 | jump slot index, the address of the TB just executed is also returned. | 
 | This address corresponds to the TB that will be patched; it may be | 
 | different than the one that was directly executed from the main loop | 
 | if the latter had already been chained to other TBs. | 
 |  | 
 | Self-modifying code and translated code invalidation | 
 | ---------------------------------------------------- | 
 |  | 
 | Self-modifying code is a special challenge in x86 emulation because no | 
 | instruction cache invalidation is signaled by the application when code | 
 | is modified. | 
 |  | 
 | User-mode emulation marks a host page as write-protected (if it is | 
 | not already read-only) every time translated code is generated for a | 
 | basic block.  Then, if a write access is done to the page, Linux raises | 
 | a SEGV signal. QEMU then invalidates all the translated code in the page | 
 | and enables write accesses to the page.  For system emulation, write | 
 | protection is achieved through the software MMU. | 
 |  | 
 | Correct translated code invalidation is done efficiently by maintaining | 
 | a linked list of every translated block contained in a given page. Other | 
 | linked lists are also maintained to undo direct block chaining. | 
 |  | 
 | On RISC targets, correctly written software uses memory barriers and | 
 | cache flushes, so some of the protection above would not be | 
 | necessary. However, QEMU still requires that the generated code always | 
 | matches the target instructions in memory in order to handle | 
 | exceptions correctly. | 
 |  | 
 | Exception support | 
 | ----------------- | 
 |  | 
 | longjmp() is used when an exception such as division by zero is | 
 | encountered. | 
 |  | 
 | The host SIGSEGV and SIGBUS signal handlers are used to get invalid | 
 | memory accesses.  QEMU keeps a map from host program counter to | 
 | target program counter, and looks up where the exception happened | 
 | based on the host program counter at the exception point. | 
 |  | 
 | On some targets, some bits of the virtual CPU's state are not flushed to the | 
 | memory until the end of the translation block.  This is done for internal | 
 | emulation state that is rarely accessed directly by the program and/or changes | 
 | very often throughout the execution of a translation block---this includes | 
 | condition codes on x86, delay slots on SPARC, conditional execution on | 
 | Arm, and so on.  This state is stored for each target instruction, and | 
 | looked up on exceptions. | 
 |  | 
 | MMU emulation | 
 | ------------- | 
 |  | 
 | For system emulation QEMU uses a software MMU. In that mode, the MMU | 
 | virtual to physical address translation is done at every memory | 
 | access. | 
 |  | 
 | QEMU uses an address translation cache (TLB) to speed up the translation. | 
 | In order to avoid flushing the translated code each time the MMU | 
 | mappings change, all caches in QEMU are physically indexed.  This | 
 | means that each basic block is indexed with its physical address. | 
 |  | 
 | In order to avoid invalidating the basic block chain when MMU mappings | 
 | change, chaining is only performed when the destination of the jump | 
 | shares a page with the basic block that is performing the jump. | 
 |  | 
 | The MMU can also distinguish RAM and ROM memory areas from MMIO memory | 
 | areas.  Access is faster for RAM and ROM because the translation cache also | 
 | hosts the offset between guest address and host memory.  Accessing MMIO | 
 | memory areas instead calls out to C code for device emulation. | 
 | Finally, the MMU helps tracking dirty pages and pages pointed to by | 
 | translation blocks. | 
 |  | 
 | Profiling JITted code | 
 | --------------------- | 
 |  | 
 | The Linux ``perf`` tool will treat all JITted code as a single block as | 
 | unlike the main code it can't use debug information to link individual | 
 | program counter samples with larger functions. To overcome this | 
 | limitation you can use the ``-perfmap`` or the ``-jitdump`` option to generate | 
 | map files. ``-perfmap`` is lightweight and produces only guest-host mappings. | 
 | ``-jitdump`` additionally saves JITed code and guest debug information (if | 
 | available); its output needs to be integrated with the ``perf.data`` file | 
 | before the final report can be viewed. | 
 |  | 
 | .. code:: | 
 |  | 
 |   perf record $QEMU -perfmap $REMAINING_ARGS | 
 |   perf report | 
 |  | 
 |   perf record -k 1 $QEMU -jitdump $REMAINING_ARGS | 
 |   DEBUGINFOD_URLS= perf inject -j -i perf.data -o perf.data.jitted | 
 |   perf report -i perf.data.jitted | 
 |  | 
 | Note that qemu-system generates mappings only for ``-kernel`` files in ELF | 
 | format. |