| Hexagon is Qualcomm's very long instruction word (VLIW) digital signal |
| processor(DSP). We also support Hexagon Vector eXtensions (HVX). HVX |
| is a wide vector coprocessor designed for high performance computer vision, |
| image processing, machine learning, and other workloads. |
| |
| The following versions of the Hexagon core are supported |
| Scalar core: v67 |
| https://developer.qualcomm.com/downloads/qualcomm-hexagon-v67-programmer-s-reference-manual |
| HVX extension: v66 |
| https://developer.qualcomm.com/downloads/qualcomm-hexagon-v66-hvx-programmer-s-reference-manual |
| |
| We presented an overview of the project at the 2019 KVM Forum. |
| https://kvmforum2019.sched.com/event/Tmwc/qemu-hexagon-automatic-translation-of-the-isa-manual-pseudcode-to-tiny-code-instructions-of-a-vliw-architecture-niccolo-izzo-revng-taylor-simpson-qualcomm-innovation-center |
| |
| *** Tour of the code *** |
| |
| The qemu-hexagon implementation is a combination of qemu and the Hexagon |
| architecture library (aka archlib). The three primary directories with |
| Hexagon-specific code are |
| |
| qemu/target/hexagon |
| This has all the instruction and packet semantics |
| qemu/target/hexagon/imported |
| These files are imported with very little modification from archlib |
| *.idef Instruction semantics definition |
| macros.def Mapping of macros to instruction attributes |
| encode*.def Encoding patterns for each instruction |
| iclass.def Instruction class definitions used to determine |
| legal VLIW slots for each instruction |
| qemu/linux-user/hexagon |
| Helpers for loading the ELF file and making Linux system calls, |
| signals, etc |
| |
| We start with scripts that generate a bunch of include files. This |
| is a two step process. The first step is to use the C preprocessor to expand |
| macros inside the architecture definition files. This is done in |
| target/hexagon/gen_semantics.c. This step produces |
| <BUILD_DIR>/target/hexagon/semantics_generated.pyinc. |
| That file is consumed by the following python scripts to produce the indicated |
| header files in <BUILD_DIR>/target/hexagon |
| gen_opcodes_def.py -> opcodes_def_generated.h.inc |
| gen_op_regs.py -> op_regs_generated.h.inc |
| gen_printinsn.py -> printinsn_generated.h.inc |
| gen_op_attribs.py -> op_attribs_generated.h.inc |
| gen_helper_protos.py -> helper_protos_generated.h.inc |
| gen_shortcode.py -> shortcode_generated.h.inc |
| gen_tcg_funcs.py -> tcg_funcs_generated.c.inc |
| gen_tcg_func_table.py -> tcg_func_table_generated.c.inc |
| gen_helper_funcs.py -> helper_funcs_generated.c.inc |
| |
| Qemu helper functions have 3 parts |
| DEF_HELPER declaration indicates the signature of the helper |
| gen_helper_<NAME> will generate a TCG call to the helper function |
| The helper implementation |
| |
| Here's an example of the A2_add instruction. |
| Instruction tag A2_add |
| Assembly syntax "Rd32=add(Rs32,Rt32)" |
| Instruction semantics "{ RdV=RsV+RtV;}" |
| |
| By convention, the operands are identified by letter |
| RdV is the destination register |
| RsV, RtV are source registers |
| |
| The generator uses the operand naming conventions (see large comment in |
| hex_common.py) to determine the signature of the helper function. Here are the |
| results for A2_add |
| |
| helper_protos_generated.h.inc |
| DEF_HELPER_3(A2_add, s32, env, s32, s32) |
| |
| tcg_funcs_generated.c.inc |
| static void generate_A2_add( |
| CPUHexagonState *env, |
| DisasContext *ctx, |
| Insn *insn, |
| Packet *pkt) |
| { |
| TCGv RdV = tcg_temp_local_new(); |
| const int RdN = insn->regno[0]; |
| TCGv RsV = hex_gpr[insn->regno[1]]; |
| TCGv RtV = hex_gpr[insn->regno[2]]; |
| gen_helper_A2_add(RdV, cpu_env, RsV, RtV); |
| gen_log_reg_write(RdN, RdV); |
| ctx_log_reg_write(ctx, RdN); |
| tcg_temp_free(RdV); |
| } |
| |
| helper_funcs_generated.c.inc |
| int32_t HELPER(A2_add)(CPUHexagonState *env, int32_t RsV, int32_t RtV) |
| { |
| uint32_t slot __attribute__((unused)) = 4; |
| int32_t RdV = 0; |
| { RdV=RsV+RtV;} |
| return RdV; |
| } |
| |
| Note that generate_A2_add updates the disassembly context to be processed |
| when the packet commits (see "Packet Semantics" below). |
| |
| The generator checks for fGEN_TCG_<tag> macro. This allows us to generate |
| TCG code instead of a call to the helper. If defined, the macro takes 1 |
| argument. |
| C semantics (aka short code) |
| |
| This allows the code generator to override the auto-generated code. In some |
| cases this is necessary for correct execution. We can also override for |
| faster emulation. For example, calling a helper for add is more expensive |
| than generating a TCG add operation. |
| |
| The gen_tcg.h file has any overrides. For example, we could write |
| #define fGEN_TCG_A2_add(GENHLPR, SHORTCODE) \ |
| tcg_gen_add_tl(RdV, RsV, RtV) |
| |
| The instruction semantics C code relies heavily on macros. In cases where the |
| C semantics are specified only with macros, we can override the default with |
| the short semantics option and #define the macros to generate TCG code. One |
| example is L2_loadw_locked: |
| Instruction tag L2_loadw_locked |
| Assembly syntax "Rd32=memw_locked(Rs32)" |
| Instruction semantics "{ fEA_REG(RsV); fLOAD_LOCKED(1,4,u,EA,RdV) }" |
| |
| In gen_tcg.h, we use the shortcode |
| #define fGEN_TCG_L2_loadw_locked(SHORTCODE) \ |
| SHORTCODE |
| |
| There are also cases where we brute force the TCG code generation. |
| Instructions with multiple definitions are examples. These require special |
| handling because qemu helpers can only return a single value. |
| |
| For HVX vectors, the generator behaves slightly differently. The wide vectors |
| won't fit in a TCGv or TCGv_i64, so we pass TCGv_ptr variables to pass the |
| address to helper functions. Here's an example for an HVX vector-add-word |
| istruction. |
| static void generate_V6_vaddw( |
| CPUHexagonState *env, |
| DisasContext *ctx, |
| Insn *insn, |
| Packet *pkt) |
| { |
| const int VdN = insn->regno[0]; |
| const intptr_t VdV_off = |
| ctx_future_vreg_off(ctx, VdN, 1, true); |
| TCGv_ptr VdV = tcg_temp_local_new_ptr(); |
| tcg_gen_addi_ptr(VdV, cpu_env, VdV_off); |
| const int VuN = insn->regno[1]; |
| const intptr_t VuV_off = |
| vreg_src_off(ctx, VuN); |
| TCGv_ptr VuV = tcg_temp_local_new_ptr(); |
| const int VvN = insn->regno[2]; |
| const intptr_t VvV_off = |
| vreg_src_off(ctx, VvN); |
| TCGv_ptr VvV = tcg_temp_local_new_ptr(); |
| tcg_gen_addi_ptr(VuV, cpu_env, VuV_off); |
| tcg_gen_addi_ptr(VvV, cpu_env, VvV_off); |
| TCGv slot = tcg_constant_tl(insn->slot); |
| gen_helper_V6_vaddw(cpu_env, VdV, VuV, VvV, slot); |
| tcg_temp_free(slot); |
| gen_log_vreg_write(ctx, VdV_off, VdN, EXT_DFL, insn->slot, false); |
| ctx_log_vreg_write(ctx, VdN, EXT_DFL, false); |
| tcg_temp_free_ptr(VdV); |
| tcg_temp_free_ptr(VuV); |
| tcg_temp_free_ptr(VvV); |
| } |
| |
| Notice that we also generate a variable named <operand>_off for each operand of |
| the instruction. This makes it easy to override the instruction semantics with |
| functions from tcg-op-gvec.h. Here's the override for this instruction. |
| #define fGEN_TCG_V6_vaddw(SHORTCODE) \ |
| tcg_gen_gvec_add(MO_32, VdV_off, VuV_off, VvV_off, \ |
| sizeof(MMVector), sizeof(MMVector)) |
| |
| Finally, we notice that the override doesn't use the TCGv_ptr variables, so |
| we don't generate them when an override is present. Here is what we generate |
| when the override is present. |
| static void generate_V6_vaddw( |
| CPUHexagonState *env, |
| DisasContext *ctx, |
| Insn *insn, |
| Packet *pkt) |
| { |
| const int VdN = insn->regno[0]; |
| const intptr_t VdV_off = |
| ctx_future_vreg_off(ctx, VdN, 1, true); |
| const int VuN = insn->regno[1]; |
| const intptr_t VuV_off = |
| vreg_src_off(ctx, VuN); |
| const int VvN = insn->regno[2]; |
| const intptr_t VvV_off = |
| vreg_src_off(ctx, VvN); |
| fGEN_TCG_V6_vaddw({ fHIDE(int i;) fVFOREACH(32, i) { VdV.w[i] = VuV.w[i] + VvV.w[i] ; } }); |
| gen_log_vreg_write(ctx, VdV_off, VdN, EXT_DFL, insn->slot, false); |
| ctx_log_vreg_write(ctx, VdN, EXT_DFL, false); |
| } |
| |
| In addition to instruction semantics, we use a generator to create the decode |
| tree. This generation is also a two step process. The first step is to run |
| target/hexagon/gen_dectree_import.c to produce |
| <BUILD_DIR>/target/hexagon/iset.py |
| This file is imported by target/hexagon/dectree.py to produce |
| <BUILD_DIR>/target/hexagon/dectree_generated.h.inc |
| |
| *** Key Files *** |
| |
| cpu.h |
| |
| This file contains the definition of the CPUHexagonState struct. It is the |
| runtime information for each thread and contains stuff like the GPR and |
| predicate registers. |
| |
| macros.h |
| mmvec/macros.h |
| |
| The Hexagon arch lib relies heavily on macros for the instruction semantics. |
| This is a great advantage for qemu because we can override them for different |
| purposes. You will also notice there are sometimes two definitions of a macro. |
| The QEMU_GENERATE variable determines whether we want the macro to generate TCG |
| code. If QEMU_GENERATE is not defined, we want the macro to generate vanilla |
| C code that will work in the helper implementation. |
| |
| translate.c |
| |
| The functions in this file generate TCG code for a translation block. Some |
| important functions in this file are |
| |
| gen_start_packet - initialize the data structures for packet semantics |
| gen_commit_packet - commit the register writes, stores, etc for a packet |
| decode_and_translate_packet - disassemble a packet and generate code |
| |
| genptr.c |
| gen_tcg.h |
| |
| These files create a function for each instruction. It is mostly composed of |
| fGEN_TCG_<tag> definitions followed by including tcg_funcs_generated.c.inc. |
| |
| op_helper.c |
| |
| This file contains the implementations of all the helpers. There are a few |
| general purpose helpers, but most of them are generated by including |
| helper_funcs_generated.c.inc. There are also several helpers used for debugging. |
| |
| |
| *** Packet Semantics *** |
| |
| VLIW packet semantics differ from serial semantics in that all input operands |
| are read, then the operations are performed, then all the results are written. |
| For exmaple, this packet performs a swap of registers r0 and r1 |
| { r0 = r1; r1 = r0 } |
| Note that the result is different if the instructions are executed serially. |
| |
| Packet semantics dictate that we defer any changes of state until the entire |
| packet is committed. We record the results of each instruction in a side data |
| structure, and update the visible processor state when we commit the packet. |
| |
| The data structures are divided between the runtime state and the translation |
| context. |
| |
| During the TCG generation (see translate.[ch]), we use the DisasContext to |
| track what needs to be done during packet commit. Here are the relevant |
| fields |
| |
| reg_log list of registers written |
| reg_log_idx index into ctx_reg_log |
| pred_log list of predicates written |
| pred_log_idx index into ctx_pred_log |
| store_width width of stores (indexed by slot) |
| |
| During runtime, the following fields in CPUHexagonState (see cpu.h) are used |
| |
| new_value new value of a given register |
| reg_written boolean indicating if register was written |
| new_pred_value new value of a predicate register |
| pred_written boolean indicating if predicate was written |
| mem_log_stores record of the stores (indexed by slot) |
| |
| For Hexagon Vector eXtensions (HVX), the following fields are used |
| VRegs Vector registers |
| future_VRegs Registers to be stored during packet commit |
| tmp_VRegs Temporary registers *not* stored during commit |
| VRegs_updated Mask of predicated vector writes |
| QRegs Q (vector predicate) registers |
| future_QRegs Registers to be stored during packet commit |
| QRegs_updated Mask of predicated vector writes |
| |
| *** Debugging *** |
| |
| You can turn on a lot of debugging by changing the HEX_DEBUG macro to 1 in |
| internal.h. This will stream a lot of information as it generates TCG and |
| executes the code. |
| |
| To track down nasty issues with Hexagon->TCG generation, we compare the |
| execution results with actual hardware running on a Hexagon Linux target. |
| Run qemu with the "-d cpu" option. Then, we can diff the results and figure |
| out where qemu and hardware behave differently. |
| |
| The stacks are located at different locations. We handle this by changing |
| env->stack_adjust in translate.c. First, set this to zero and run qemu. |
| Then, change env->stack_adjust to the difference between the two stack |
| locations. Then rebuild qemu and run again. That will produce a very |
| clean diff. |
| |
| Here are some handy places to set breakpoints |
| |
| At the call to gen_start_packet for a given PC (note that the line number |
| might change in the future) |
| br translate.c:602 if ctx->base.pc_next == 0xdeadbeef |
| The helper function for each instruction is named helper_<TAG>, so here's |
| an example that will set a breakpoint at the start |
| br helper_A2_add |
| If you have the HEX_DEBUG macro set, the following will be useful |
| At the start of execution of a packet for a given PC |
| br helper_debug_start_packet if env->gpr[41] == 0xdeadbeef |
| At the end of execution of a packet for a given PC |
| br helper_debug_commit_end if env->this_PC == 0xdeadbeef |