| PCI |
| === |
| |
| Debugging |
| --------- |
| |
| There exist a couple of NVRAM options for enabling extra debug functionality |
| to help debug PCI issues. These are not ABI and may be changed or removed at |
| **any** time. |
| |
| Verbose EEH |
| ^^^^^^^^^^^ |
| |
| :: |
| |
| nvram -p ibm,skiboot --update-config pci-eeh-verbose=true |
| |
| Disable EEH MMIO |
| ^^^^^^^^^^^^^^^^ |
| :: |
| nvram -p ibm,skiboot --update-config pci-eeh-mmio=disabled |
| |
| |
| Check for RX errors after link training |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Some PHB4 PHYs can get stuck in a bad state where they are constantly |
| retraining the link. This happens transparently to skiboot and Linux |
| but will causes PCIe to be slow. Resetting the PHB4 clears the |
| problem. |
| |
| We can detect this case by looking at the RX errors count where we |
| check for link stability. This patch does this by modifying the link |
| optimal code to check for RX errors. If errors are occurring we |
| retrain the link irrespective of the chip rev or card. |
| |
| Normally when this problem occurs, the RX error count is maxed out at |
| 255. When there is no problem, the count is 0. We chose 8 as the max |
| rx errors value to give us some margin for a few errors. There is also |
| a knob that can be used to set the error threshold for when we should |
| retrain the link. i.e. :: |
| |
| nvram -p ibm,skiboot --update-config phb-rx-err-max=8 |
| |
| Retrain link if degraded |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| On P9 Scale Out (Nimbus) DD2.0 and Scale in (Cumulus) DD1.0 (and |
| below) the PCIe PHY can lockup causing training issues. This can cause |
| a degradation in speed or width in ~5% of training cases (depending on |
| the card). This is fixed in later chip revisions. This issue can also |
| cause PCIe links to not train at all, but this case is already |
| handled. |
| |
| There is code in skiboot that checks if the PCIe link has trained optimally |
| and if not, does a full PHB reset (to fix the PHY lockup) and retrain. |
| |
| One complication is some devices are known to train degraded unless |
| device specific configuration is performed. Because of this, we only |
| retrain when the device is in a whitelist. All devices in the current |
| whitelist have been testing on a P9DSU/Boston, ZZ and Witherspoon. |
| |
| We always gather information on the link and print it in the logs even |
| if the card is not in the whitelist. |
| |
| For testing purposes, there's an nvram to retry all PCIe cards and all |
| P9 chips when a degraded link is detected. The new option is |
| 'pci-retry-all=true' which can be set using: :: |
| |
| nvram -p ibm,skiboot --update-config pci-retry-all=true |
| |
| This option may increase the boot time if used on a badly behaving |
| card. |
| |
| Maximum link speed |
| ^^^^^^^^^^^^^^^^^^ |
| |
| Was useful during bringup on P9 DD1. |
| |
| :: |
| nvram -p ibm,skiboot --update-config pcie-max-link-speed=4 |
| |
| |
| Ric Mata Mode |
| ^^^^^^^^^^^^^ |
| |
| This mode (for PHB4) will trace the training process closely. This activates |
| as soon as PERST is deasserted and produces human readable output of |
| the process. |
| |
| It will also add the PCIe Link Training and Status State Machine (LTSSM) tracing |
| and details on speed and link width. |
| |
| Output looks a bit like this :: |
| |
| [ 1.096995141,3] PHB#0000[0:0]: TRACE:0x0000001101000000 0ms GEN1:x16:detect |
| [ 1.102849137,3] PHB#0000[0:0]: TRACE:0x0000102101000000 11ms presence GEN1:x16:polling |
| [ 1.104341838,3] PHB#0000[0:0]: TRACE:0x0000182101000000 14ms training GEN1:x16:polling |
| [ 1.104357444,3] PHB#0000[0:0]: TRACE:0x00001c5101000000 14ms training GEN1:x16:recovery |
| [ 1.104580394,3] PHB#0000[0:0]: TRACE:0x00001c5103000000 14ms training GEN3:x16:recovery |
| [ 1.123259359,3] PHB#0000[0:0]: TRACE:0x00001c5104000000 51ms training GEN4:x16:recovery |
| [ 1.141737656,3] PHB#0000[0:0]: TRACE:0x0000144104000000 87ms presence GEN4:x16:L0 |
| [ 1.141752318,3] PHB#0000[0:0]: TRACE:0x0000154904000000 87ms trained GEN4:x16:L0 |
| [ 1.141757964,3] PHB#0000[0:0]: TRACE: Link trained. |
| [ 1.096834019,3] PHB#0001[0:1]: TRACE:0x0000001101000000 0ms GEN1:x16:detect |
| [ 1.105578525,3] PHB#0001[0:1]: TRACE:0x0000102101000000 17ms presence GEN1:x16:polling |
| [ 1.112763075,3] PHB#0001[0:1]: TRACE:0x0000183101000000 31ms training GEN1:x16:config |
| [ 1.112778956,3] PHB#0001[0:1]: TRACE:0x00001c5081000000 31ms training GEN1:x08:recovery |
| [ 1.113002083,3] PHB#0001[0:1]: TRACE:0x00001c5083000000 31ms training GEN3:x08:recovery |
| [ 1.114833873,3] PHB#0001[0:1]: TRACE:0x0000144083000000 35ms presence GEN3:x08:L0 |
| [ 1.114848832,3] PHB#0001[0:1]: TRACE:0x0000154883000000 35ms trained GEN3:x08:L0 |
| [ 1.114854650,3] PHB#0001[0:1]: TRACE: Link trained. |
| |
| Enabled via NVRAM: :: |
| |
| nvram -p ibm,skiboot --update-config pci-tracing=true |
| |
| Named after the person the output of this mode is typically sent to. |
| |
| |
| **WARNING**: The documentation below **urgently needs updating** and is *woefully* incomplete. |
| |
| IODA PE Setup Sequences |
| ----------------------- |
| |
| (**WARNING**: this was rescued from old internal documentation. Needs verification) |
| |
| To setup basic PE mappings, the host performs this basic sequence: |
| |
| For ibm,opal-ioda2, prior to allocating PHB resources to PEs, the host must |
| allocate memory for PE structures and then calls |
| ``opal_pci_set_phb_table_memory( phb_id, rtt_addr, ivt_addr, ivt_len, rrba_addr, peltv_addr)`` to define them to the PHB. OPAL returns ``OPAL_UNSUPPORTED`` status for ``ibm,opal-ioda`` PHBs. |
| |
| The host calls ``opal_pci_set_pe( phb_id, pe_number, bus, dev, func, validate_mask, bus_mask, dev_mask, func mask)`` to map a PE to a PCI RID or range of RIDs in the same PE domain. |
| |
| The host calls ``opal_pci_set_peltv(phb_id, parent_pe, child_pe, state)`` to |
| set a parent PELT vector bit for the child PE argument to 1 (a child of the |
| parent) or 0 (not in the parent PE domain). |
| |
| IODA MMIO Setup Sequences |
| ------------------------- |
| |
| (**WARNING**: this was rescued from old internal documentation. Needs verification) |
| |
| |
| The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x0)`` to disable the MMIO window. |
| |
| The host calls ``opal_pci_set_phb_mmio_window( phb_id, mmio_window, starting_real_address, starting_pci_address, segment_size)`` to change the MMIO window location in PCI and/or processor real address space, or to change the size -- and corresponding window size -- of a particular MMIO window. |
| |
| The host calls ``opal_pci_map_pe_mmio_window( pe_number, mmio_window, segment_number)`` to map PEs to window segments, for each segment mapped to each PE. |
| |
| The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x1)`` to enable the MMIO window. |
| |
| IODA MSI Setup Sequences |
| ------------------------ |
| |
| (**WARNING**: this was rescued from old internal documentation. Needs verification) |
| |
| To setup MSIs: |
| |
| 1. For ibm,opal-ioda PHBs, the host chooses an MVE for a PE to use and calls ``opal_pci_set_mve( phb_id, mve_number, pe_number,)`` to setup the MVE for the PE number. HAL treats this call as a NOP and returns hal_success status for ibm,opal-ioda2 PHBs. |
| 2. The host chooses an XIVE to use with a PE and calls |
| a. ``opal_pci_set_xive_pe( phb_id, xive_number, pe_number)`` to authorize that PE to signal that XIVE as an interrupt. The host must call this function for each XIVE assigned to a particular PE, but may use this call for all XIVEs prior to calling ``opel_pci_set_mve()`` to bind the PE XIVEs to an MVE. For MSI conventional, the host must bind a unique MVE for each sequential set of 32 XIVEs. |
| b. The host forms the interrupt_source_number from the combination of the device tree MSI property base BUID and XIVE number, as an input to ``opal_set_xive(interrupt_source_number, server_number, priority)`` and ``opal_get_xive(interrupt_source_number, server_number, priority)`` to set or return the server and priority numbers within an XIVE. |
| c. ``opal_get_msi_64[32](phb_id, mve_number, xive_num, msi_range, msi_address, message_data)`` to determine the MSI DMA address (32 or 64 bit) and message data value for that xive. |
| |
| For MSI conventional, the host uses this for each sequential power of 2 set of 1 to 32 MSIs, to determine the MSI DMA address and starting message data value for that MSI range. For MSI-X, the host calls this uniquely for each MSI interrupt with an msi_range input value of 1. |
| 3. For ``ibm,opal-ioda`` PHBs, once the MVE and XIVRs are setup for a PE, the host calls ``opal_pci_set_mve_enable( phb_id, mve_number, state)`` to enable that MVE to be a valid target of MSI DMAs. The host may also call this function to disable an MVE when changing PE domains or states. |
| |
| IODA DMA Setup Sequences |
| ------------------------ |
| |
| (**WARNING**: this was rescued from old internal documentation. Needs verification) |
| |
| To Manage DMA Windows : |
| |
| 1. The host calls ``opal_pci_map_pe_dma_window( phb_id, dma_window_number, pe_number, tce_levels, tce_table_addr, tce_table_size, tce_page_size, utin64_t* pci_start_addr )`` to setup a DMA window for a PE to translate through a TCE table structure in KVM memory. |
| 2. The host calls ``opal_pci_map_pe_dma_window_real( phb_id, dma_window_number, pe_number, mem_low_addr, mem_high_addr)`` to setup a DMA window for a PE that is translated (but validated by the PHB as an untranlsated address space authorized to this PE). |
| |
| Device Tree Bindings |
| -------------------- |
| |
| See :doc:`device-tree/pci` for device tree information. |