| .. |
| Copyright (C) 2017, Emilio G. Cota <cota@braap.org> |
| Copyright (c) 2019, Linaro Limited |
| Written by Emilio Cota and Alex Bennée |
| |
| QEMU TCG Plugins |
| ================ |
| |
| QEMU TCG plugins provide a way for users to run experiments taking |
| advantage of the total system control emulation can have over a guest. |
| It provides a mechanism for plugins to subscribe to events during |
| translation and execution and optionally callback into the plugin |
| during these events. TCG plugins are unable to change the system state |
| only monitor it passively. However they can do this down to an |
| individual instruction granularity including potentially subscribing |
| to all load and store operations. |
| |
| Usage |
| ----- |
| |
| Any QEMU binary with TCG support has plugins enabled by default. |
| Earlier releases needed to be explicitly enabled with:: |
| |
| configure --enable-plugins |
| |
| Once built a program can be run with multiple plugins loaded each with |
| their own arguments:: |
| |
| $QEMU $OTHER_QEMU_ARGS \ |
| -plugin contrib/plugin/libhowvec.so,inline=on,count=hint \ |
| -plugin contrib/plugin/libhotblocks.so |
| |
| Arguments are plugin specific and can be used to modify their |
| behaviour. In this case the howvec plugin is being asked to use inline |
| ops to count and break down the hint instructions by type. |
| |
| Linux user-mode emulation also evaluates the environment variable |
| ``QEMU_PLUGIN``:: |
| |
| QEMU_PLUGIN="file=contrib/plugins/libhowvec.so,inline=on,count=hint" $QEMU |
| |
| Writing plugins |
| --------------- |
| |
| API versioning |
| ~~~~~~~~~~~~~~ |
| |
| This is a new feature for QEMU and it does allow people to develop |
| out-of-tree plugins that can be dynamically linked into a running QEMU |
| process. However the project reserves the right to change or break the |
| API should it need to do so. The best way to avoid this is to submit |
| your plugin upstream so they can be updated if/when the API changes. |
| |
| All plugins need to declare a symbol which exports the plugin API |
| version they were built against. This can be done simply by:: |
| |
| QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION; |
| |
| The core code will refuse to load a plugin that doesn't export a |
| ``qemu_plugin_version`` symbol or if plugin version is outside of QEMU's |
| supported range of API versions. |
| |
| Additionally the ``qemu_info_t`` structure which is passed to the |
| ``qemu_plugin_install`` method of a plugin will detail the minimum and |
| current API versions supported by QEMU. The API version will be |
| incremented if new APIs are added. The minimum API version will be |
| incremented if existing APIs are changed or removed. |
| |
| Lifetime of the query handle |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Each callback provides an opaque anonymous information handle which |
| can usually be further queried to find out information about a |
| translation, instruction or operation. The handles themselves are only |
| valid during the lifetime of the callback so it is important that any |
| information that is needed is extracted during the callback and saved |
| by the plugin. |
| |
| Plugin life cycle |
| ~~~~~~~~~~~~~~~~~ |
| |
| First the plugin is loaded and the public qemu_plugin_install function |
| is called. The plugin will then register callbacks for various plugin |
| events. Generally plugins will register a handler for the *atexit* |
| if they want to dump a summary of collected information once the |
| program/system has finished running. |
| |
| When a registered event occurs the plugin callback is invoked. The |
| callbacks may provide additional information. In the case of a |
| translation event the plugin has an option to enumerate the |
| instructions in a block of instructions and optionally register |
| callbacks to some or all instructions when they are executed. |
| |
| There is also a facility to add an inline event where code to |
| increment a counter can be directly inlined with the translation. |
| Currently only a simple increment is supported. This is not atomic so |
| can miss counts. If you want absolute precision you should use a |
| callback which can then ensure atomicity itself. |
| |
| Finally when QEMU exits all the registered *atexit* callbacks are |
| invoked. |
| |
| Exposure of QEMU internals |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The plugin architecture actively avoids leaking implementation details |
| about how QEMU's translation works to the plugins. While there are |
| conceptions such as translation time and translation blocks the |
| details are opaque to plugins. The plugin is able to query select |
| details of instructions and system configuration only through the |
| exported *qemu_plugin* functions. |
| |
| Internals |
| --------- |
| |
| Locking |
| ~~~~~~~ |
| |
| We have to ensure we cannot deadlock, particularly under MTTCG. For |
| this we acquire a lock when called from plugin code. We also keep the |
| list of callbacks under RCU so that we do not have to hold the lock |
| when calling the callbacks. This is also for performance, since some |
| callbacks (e.g. memory access callbacks) might be called very |
| frequently. |
| |
| * A consequence of this is that we keep our own list of CPUs, so that |
| we do not have to worry about locking order wrt cpu_list_lock. |
| * Use a recursive lock, since we can get registration calls from |
| callbacks. |
| |
| As a result registering/unregistering callbacks is "slow", since it |
| takes a lock. But this is very infrequent; we want performance when |
| calling (or not calling) callbacks, not when registering them. Using |
| RCU is great for this. |
| |
| We support the uninstallation of a plugin at any time (e.g. from |
| plugin callbacks). This allows plugins to remove themselves if they no |
| longer want to instrument the code. This operation is asynchronous |
| which means callbacks may still occur after the uninstall operation is |
| requested. The plugin isn't completely uninstalled until the safe work |
| has executed while all vCPUs are quiescent. |
| |
| Example Plugins |
| --------------- |
| |
| There are a number of plugins included with QEMU and you are |
| encouraged to contribute your own plugins plugins upstream. There is a |
| ``contrib/plugins`` directory where they can go. There are also some |
| basic plugins that are used to test and exercise the API during the |
| ``make check-tcg`` target in ``tests\plugins``. |
| |
| - tests/plugins/empty.c |
| |
| Purely a test plugin for measuring the overhead of the plugins system |
| itself. Does no instrumentation. |
| |
| - tests/plugins/bb.c |
| |
| A very basic plugin which will measure execution in course terms as |
| each basic block is executed. By default the results are shown once |
| execution finishes:: |
| |
| $ qemu-aarch64 -plugin tests/plugin/libbb.so \ |
| -d plugin ./tests/tcg/aarch64-linux-user/sha1 |
| SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 |
| bb's: 2277338, insns: 158483046 |
| |
| Behaviour can be tweaked with the following arguments: |
| |
| * inline=true|false |
| |
| Use faster inline addition of a single counter. Not per-cpu and not |
| thread safe. |
| |
| * idle=true|false |
| |
| Dump the current execution stats whenever the guest vCPU idles |
| |
| - tests/plugins/insn.c |
| |
| This is a basic instruction level instrumentation which can count the |
| number of instructions executed on each core/thread:: |
| |
| $ qemu-aarch64 -plugin tests/plugin/libinsn.so \ |
| -d plugin ./tests/tcg/aarch64-linux-user/threadcount |
| Created 10 threads |
| Done |
| cpu 0 insns: 46765 |
| cpu 1 insns: 3694 |
| cpu 2 insns: 3694 |
| cpu 3 insns: 2994 |
| cpu 4 insns: 1497 |
| cpu 5 insns: 1497 |
| cpu 6 insns: 1497 |
| cpu 7 insns: 1497 |
| total insns: 63135 |
| |
| Behaviour can be tweaked with the following arguments: |
| |
| * inline=true|false |
| |
| Use faster inline addition of a single counter. Not per-cpu and not |
| thread safe. |
| |
| * sizes=true|false |
| |
| Give a summary of the instruction sizes for the execution |
| |
| * match=<string> |
| |
| Only instrument instructions matching the string prefix. Will show |
| some basic stats including how many instructions have executed since |
| the last execution. For example:: |
| |
| $ qemu-aarch64 -plugin tests/plugin/libinsn.so,match=bl \ |
| -d plugin ./tests/tcg/aarch64-linux-user/sha512-vector |
| ... |
| 0x40069c, 'bl #0x4002b0', 10 hits, 1093 match hits, Δ+1257 since last match, 98 avg insns/match |
| 0x4006ac, 'bl #0x403690', 10 hits, 1094 match hits, Δ+47 since last match, 98 avg insns/match |
| 0x4037fc, 'bl #0x4002b0', 18 hits, 1095 match hits, Δ+22 since last match, 98 avg insns/match |
| 0x400720, 'bl #0x403690', 10 hits, 1096 match hits, Δ+58 since last match, 98 avg insns/match |
| 0x4037fc, 'bl #0x4002b0', 19 hits, 1097 match hits, Δ+22 since last match, 98 avg insns/match |
| 0x400730, 'bl #0x403690', 10 hits, 1098 match hits, Δ+33 since last match, 98 avg insns/match |
| 0x4037ac, 'bl #0x4002b0', 12 hits, 1099 match hits, Δ+20 since last match, 98 avg insns/match |
| ... |
| |
| For more detailed execution tracing see the ``execlog`` plugin for |
| other options. |
| |
| - tests/plugins/mem.c |
| |
| Basic instruction level memory instrumentation:: |
| |
| $ qemu-aarch64 -plugin tests/plugin/libmem.so,inline=true \ |
| -d plugin ./tests/tcg/aarch64-linux-user/sha1 |
| SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 |
| inline mem accesses: 79525013 |
| |
| Behaviour can be tweaked with the following arguments: |
| |
| * inline=true|false |
| |
| Use faster inline addition of a single counter. Not per-cpu and not |
| thread safe. |
| |
| * callback=true|false |
| |
| Use callbacks on each memory instrumentation. |
| |
| * hwaddr=true|false |
| |
| Count IO accesses (only for system emulation) |
| |
| - tests/plugins/syscall.c |
| |
| A basic syscall tracing plugin. This only works for user-mode. By |
| default it will give a summary of syscall stats at the end of the |
| run:: |
| |
| $ qemu-aarch64 -plugin tests/plugin/libsyscall \ |
| -d plugin ./tests/tcg/aarch64-linux-user/threadcount |
| Created 10 threads |
| Done |
| syscall no. calls errors |
| 226 12 0 |
| 99 11 11 |
| 115 11 0 |
| 222 11 0 |
| 93 10 0 |
| 220 10 0 |
| 233 10 0 |
| 215 8 0 |
| 214 4 0 |
| 134 2 0 |
| 64 2 0 |
| 96 1 0 |
| 94 1 0 |
| 80 1 0 |
| 261 1 0 |
| 78 1 0 |
| 160 1 0 |
| 135 1 0 |
| |
| - contrib/plugins/hotblocks.c |
| |
| The hotblocks plugin allows you to examine the where hot paths of |
| execution are in your program. Once the program has finished you will |
| get a sorted list of blocks reporting the starting PC, translation |
| count, number of instructions and execution count. This will work best |
| with linux-user execution as system emulation tends to generate |
| re-translations as blocks from different programs get swapped in and |
| out of system memory. |
| |
| If your program is single-threaded you can use the ``inline`` option for |
| slightly faster (but not thread safe) counters. |
| |
| Example:: |
| |
| $ qemu-aarch64 \ |
| -plugin contrib/plugins/libhotblocks.so -d plugin \ |
| ./tests/tcg/aarch64-linux-user/sha1 |
| SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 |
| collected 903 entries in the hash table |
| pc, tcount, icount, ecount |
| 0x0000000041ed10, 1, 5, 66087 |
| 0x000000004002b0, 1, 4, 66087 |
| ... |
| |
| - contrib/plugins/hotpages.c |
| |
| Similar to hotblocks but this time tracks memory accesses:: |
| |
| $ qemu-aarch64 \ |
| -plugin contrib/plugins/libhotpages.so -d plugin \ |
| ./tests/tcg/aarch64-linux-user/sha1 |
| SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 |
| Addr, RCPUs, Reads, WCPUs, Writes |
| 0x000055007fe000, 0x0001, 31747952, 0x0001, 8835161 |
| 0x000055007ff000, 0x0001, 29001054, 0x0001, 8780625 |
| 0x00005500800000, 0x0001, 687465, 0x0001, 335857 |
| 0x0000000048b000, 0x0001, 130594, 0x0001, 355 |
| 0x0000000048a000, 0x0001, 1826, 0x0001, 11 |
| |
| The hotpages plugin can be configured using the following arguments: |
| |
| * sortby=reads|writes|address |
| |
| Log the data sorted by either the number of reads, the number of writes, or |
| memory address. (Default: entries are sorted by the sum of reads and writes) |
| |
| * io=on |
| |
| Track IO addresses. Only relevant to full system emulation. (Default: off) |
| |
| * pagesize=N |
| |
| The page size used. (Default: N = 4096) |
| |
| - contrib/plugins/howvec.c |
| |
| This is an instruction classifier so can be used to count different |
| types of instructions. It has a number of options to refine which get |
| counted. You can give a value to the ``count`` argument for a class of |
| instructions to break it down fully, so for example to see all the system |
| registers accesses:: |
| |
| $ qemu-system-aarch64 $(QEMU_ARGS) \ |
| -append "root=/dev/sda2 systemd.unit=benchmark.service" \ |
| -smp 4 -plugin ./contrib/plugins/libhowvec.so,count=sreg -d plugin |
| |
| which will lead to a sorted list after the class breakdown:: |
| |
| Instruction Classes: |
| Class: UDEF not counted |
| Class: SVE (68 hits) |
| Class: PCrel addr (47789483 hits) |
| Class: Add/Sub (imm) (192817388 hits) |
| Class: Logical (imm) (93852565 hits) |
| Class: Move Wide (imm) (76398116 hits) |
| Class: Bitfield (44706084 hits) |
| Class: Extract (5499257 hits) |
| Class: Cond Branch (imm) (147202932 hits) |
| Class: Exception Gen (193581 hits) |
| Class: NOP not counted |
| Class: Hints (6652291 hits) |
| Class: Barriers (8001661 hits) |
| Class: PSTATE (1801695 hits) |
| Class: System Insn (6385349 hits) |
| Class: System Reg counted individually |
| Class: Branch (reg) (69497127 hits) |
| Class: Branch (imm) (84393665 hits) |
| Class: Cmp & Branch (110929659 hits) |
| Class: Tst & Branch (44681442 hits) |
| Class: AdvSimd ldstmult (736 hits) |
| Class: ldst excl (9098783 hits) |
| Class: Load Reg (lit) (87189424 hits) |
| Class: ldst noalloc pair (3264433 hits) |
| Class: ldst pair (412526434 hits) |
| Class: ldst reg (imm) (314734576 hits) |
| Class: Loads & Stores (2117774 hits) |
| Class: Data Proc Reg (223519077 hits) |
| Class: Scalar FP (31657954 hits) |
| Individual Instructions: |
| Instr: mrs x0, sp_el0 (2682661 hits) (op=0xd5384100/ System Reg) |
| Instr: mrs x1, tpidr_el2 (1789339 hits) (op=0xd53cd041/ System Reg) |
| Instr: mrs x2, tpidr_el2 (1513494 hits) (op=0xd53cd042/ System Reg) |
| Instr: mrs x0, tpidr_el2 (1490823 hits) (op=0xd53cd040/ System Reg) |
| Instr: mrs x1, sp_el0 (933793 hits) (op=0xd5384101/ System Reg) |
| Instr: mrs x2, sp_el0 (699516 hits) (op=0xd5384102/ System Reg) |
| Instr: mrs x4, tpidr_el2 (528437 hits) (op=0xd53cd044/ System Reg) |
| Instr: mrs x30, ttbr1_el1 (480776 hits) (op=0xd538203e/ System Reg) |
| Instr: msr ttbr1_el1, x30 (480713 hits) (op=0xd518203e/ System Reg) |
| Instr: msr vbar_el1, x30 (480671 hits) (op=0xd518c01e/ System Reg) |
| ... |
| |
| To find the argument shorthand for the class you need to examine the |
| source code of the plugin at the moment, specifically the ``*opt`` |
| argument in the InsnClassExecCount tables. |
| |
| - contrib/plugins/lockstep.c |
| |
| This is a debugging tool for developers who want to find out when and |
| where execution diverges after a subtle change to TCG code generation. |
| It is not an exact science and results are likely to be mixed once |
| asynchronous events are introduced. While the use of -icount can |
| introduce determinism to the execution flow it doesn't always follow |
| the translation sequence will be exactly the same. Typically this is |
| caused by a timer firing to service the GUI causing a block to end |
| early. However in some cases it has proved to be useful in pointing |
| people at roughly where execution diverges. The only argument you need |
| for the plugin is a path for the socket the two instances will |
| communicate over:: |
| |
| |
| $ qemu-system-sparc -monitor none -parallel none \ |
| -net none -M SS-20 -m 256 -kernel day11/zImage.elf \ |
| -plugin ./contrib/plugins/liblockstep.so,sockpath=lockstep-sparc.sock \ |
| -d plugin,nochain |
| |
| which will eventually report:: |
| |
| qemu-system-sparc: warning: nic lance.0 has no peer |
| @ 0x000000ffd06678 vs 0x000000ffd001e0 (2/1 since last) |
| @ 0x000000ffd07d9c vs 0x000000ffd06678 (3/1 since last) |
| Δ insn_count @ 0x000000ffd07d9c (809900609) vs 0x000000ffd06678 (809900612) |
| previously @ 0x000000ffd06678/10 (809900609 insns) |
| previously @ 0x000000ffd001e0/4 (809900599 insns) |
| previously @ 0x000000ffd080ac/2 (809900595 insns) |
| previously @ 0x000000ffd08098/5 (809900593 insns) |
| previously @ 0x000000ffd080c0/1 (809900588 insns) |
| |
| - contrib/plugins/hwprofile.c |
| |
| The hwprofile tool can only be used with system emulation and allows |
| the user to see what hardware is accessed how often. It has a number of options: |
| |
| * track=read or track=write |
| |
| By default the plugin tracks both reads and writes. You can use one |
| of these options to limit the tracking to just one class of accesses. |
| |
| * source |
| |
| Will include a detailed break down of what the guest PC that made the |
| access was. Not compatible with the pattern option. Example output:: |
| |
| cirrus-low-memory @ 0xfffffd00000a0000 |
| pc:fffffc0000005cdc, 1, 256 |
| pc:fffffc0000005ce8, 1, 256 |
| pc:fffffc0000005cec, 1, 256 |
| |
| * pattern |
| |
| Instead break down the accesses based on the offset into the HW |
| region. This can be useful for seeing the most used registers of a |
| device. Example output:: |
| |
| pci0-conf @ 0xfffffd01fe000000 |
| off:00000004, 1, 1 |
| off:00000010, 1, 3 |
| off:00000014, 1, 3 |
| off:00000018, 1, 2 |
| off:0000001c, 1, 2 |
| off:00000020, 1, 2 |
| ... |
| |
| - contrib/plugins/execlog.c |
| |
| The execlog tool traces executed instructions with memory access. It can be used |
| for debugging and security analysis purposes. |
| Please be aware that this will generate a lot of output. |
| |
| The plugin needs default argument:: |
| |
| $ qemu-system-arm $(QEMU_ARGS) \ |
| -plugin ./contrib/plugins/libexeclog.so -d plugin |
| |
| which will output an execution trace following this structure:: |
| |
| # vCPU, vAddr, opcode, disassembly[, load/store, memory addr, device]... |
| 0, 0xa12, 0xf8012400, "movs r4, #0" |
| 0, 0xa14, 0xf87f42b4, "cmp r4, r6" |
| 0, 0xa16, 0xd206, "bhs #0xa26" |
| 0, 0xa18, 0xfff94803, "ldr r0, [pc, #0xc]", load, 0x00010a28, RAM |
| 0, 0xa1a, 0xf989f000, "bl #0xd30" |
| 0, 0xd30, 0xfff9b510, "push {r4, lr}", store, 0x20003ee0, RAM, store, 0x20003ee4, RAM |
| 0, 0xd32, 0xf9893014, "adds r0, #0x14" |
| 0, 0xd34, 0xf9c8f000, "bl #0x10c8" |
| 0, 0x10c8, 0xfff96c43, "ldr r3, [r0, #0x44]", load, 0x200000e4, RAM |
| |
| the output can be filtered to only track certain instructions or |
| addresses using the ``ifilter`` or ``afilter`` options. You can stack the |
| arguments if required:: |
| |
| $ qemu-system-arm $(QEMU_ARGS) \ |
| -plugin ./contrib/plugins/libexeclog.so,ifilter=st1w,afilter=0x40001808 -d plugin |
| |
| - contrib/plugins/cache.c |
| |
| Cache modelling plugin that measures the performance of a given L1 cache |
| configuration, and optionally a unified L2 per-core cache when a given working |
| set is run:: |
| |
| $ qemu-x86_64 -plugin ./contrib/plugins/libcache.so \ |
| -d plugin -D cache.log ./tests/tcg/x86_64-linux-user/float_convs |
| |
| will report the following:: |
| |
| core #, data accesses, data misses, dmiss rate, insn accesses, insn misses, imiss rate |
| 0 996695 508 0.0510% 2642799 18617 0.7044% |
| |
| address, data misses, instruction |
| 0x424f1e (_int_malloc), 109, movq %rax, 8(%rcx) |
| 0x41f395 (_IO_default_xsputn), 49, movb %dl, (%rdi, %rax) |
| 0x42584d (ptmalloc_init.part.0), 33, movaps %xmm0, (%rax) |
| 0x454d48 (__tunables_init), 20, cmpb $0, (%r8) |
| ... |
| |
| address, fetch misses, instruction |
| 0x4160a0 (__vfprintf_internal), 744, movl $1, %ebx |
| 0x41f0a0 (_IO_setb), 744, endbr64 |
| 0x415882 (__vfprintf_internal), 744, movq %r12, %rdi |
| 0x4268a0 (__malloc), 696, andq $0xfffffffffffffff0, %rax |
| ... |
| |
| The plugin has a number of arguments, all of them are optional: |
| |
| * limit=N |
| |
| Print top N icache and dcache thrashing instructions along with their |
| address, number of misses, and its disassembly. (default: 32) |
| |
| * icachesize=N |
| * iblksize=B |
| * iassoc=A |
| |
| Instruction cache configuration arguments. They specify the cache size, block |
| size, and associativity of the instruction cache, respectively. |
| (default: N = 16384, B = 64, A = 8) |
| |
| * dcachesize=N |
| * dblksize=B |
| * dassoc=A |
| |
| Data cache configuration arguments. They specify the cache size, block size, |
| and associativity of the data cache, respectively. |
| (default: N = 16384, B = 64, A = 8) |
| |
| * evict=POLICY |
| |
| Sets the eviction policy to POLICY. Available policies are: :code:`lru`, |
| :code:`fifo`, and :code:`rand`. The plugin will use the specified policy for |
| both instruction and data caches. (default: POLICY = :code:`lru`) |
| |
| * cores=N |
| |
| Sets the number of cores for which we maintain separate icache and dcache. |
| (default: for linux-user, N = 1, for full system emulation: N = cores |
| available to guest) |
| |
| * l2=on |
| |
| Simulates a unified L2 cache (stores blocks for both instructions and data) |
| using the default L2 configuration (cache size = 2MB, associativity = 16-way, |
| block size = 64B). |
| |
| * l2cachesize=N |
| * l2blksize=B |
| * l2assoc=A |
| |
| L2 cache configuration arguments. They specify the cache size, block size, and |
| associativity of the L2 cache, respectively. Setting any of the L2 |
| configuration arguments implies ``l2=on``. |
| (default: N = 2097152 (2MB), B = 64, A = 16) |
| |
| API |
| --- |
| |
| The following API is generated from the inline documentation in |
| ``include/qemu/qemu-plugin.h``. Please ensure any updates to the API |
| include the full kernel-doc annotations. |
| |
| .. kernel-doc:: include/qemu/qemu-plugin.h |
| |