doc/pci.rst - skiboot - Git at Google

 PCI
 ===

 Debugging
 ---------

 There exist a couple of NVRAM options for enabling extra debug functionality
 to help debug PCI issues. These are not ABI and may be changed or removed at
 **any** time.

 Verbose EEH
 ^^^^^^^^^^^

 ::

    nvram -p ibm,skiboot --update-config pci-eeh-verbose=true

 Disable EEH MMIO
 ^^^^^^^^^^^^^^^^
 ::
    nvram -p ibm,skiboot --update-config pci-eeh-mmio=disabled


 Check for RX errors after link training
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Some PHB4 PHYs can get stuck in a bad state where they are constantly
 retraining the link. This happens transparently to skiboot and Linux
 but will causes PCIe to be slow. Resetting the PHB4 clears the
 problem.

 We can detect this case by looking at the RX errors count where we
 check for link stability. This patch does this by modifying the link
 optimal code to check for RX errors. If errors are occurring we
 retrain the link irrespective of the chip rev or card.

 Normally when this problem occurs, the RX error count is maxed out at
 255. When there is no problem, the count is 0. We chose 8 as the max
 rx errors value to give us some margin for a few errors. There is also
 a knob that can be used to set the error threshold for when we should
 retrain the link. i.e. ::

       nvram -p ibm,skiboot --update-config phb-rx-err-max=8

 Retrain link if degraded
 ^^^^^^^^^^^^^^^^^^^^^^^^

 On P9 Scale Out (Nimbus) DD2.0 and Scale in (Cumulus) DD1.0 (and
 below) the PCIe PHY can lockup causing training issues. This can cause
 a degradation in speed or width in ~5% of training cases (depending on
 the card). This is fixed in later chip revisions. This issue can also
 cause PCIe links to not train at all, but this case is already
 handled.

 There is code in skiboot that checks if the PCIe link has trained optimally
 and if not, does a full PHB reset (to fix the PHY lockup) and retrain.

 One complication is some devices are known to train degraded unless
 device specific configuration is performed. Because of this, we only
 retrain when the device is in a whitelist. All devices in the current
 whitelist have been testing on a P9DSU/Boston, ZZ and Witherspoon.

 We always gather information on the link and print it in the logs even
 if the card is not in the whitelist.

 For testing purposes, there's an nvram to retry all PCIe cards and all
 P9 chips when a degraded link is detected. The new option is
 'pci-retry-all=true' which can be set using: ::

   nvram -p ibm,skiboot --update-config pci-retry-all=true

 This option may increase the boot time if used on a badly behaving
 card.

 Maximum link speed
 ^^^^^^^^^^^^^^^^^^

 Was useful during bringup on P9 DD1.

 ::
    nvram -p ibm,skiboot --update-config pcie-max-link-speed=4


 Ric Mata Mode
 ^^^^^^^^^^^^^

 This mode (for PHB4) will trace the training process closely. This activates
 as soon as PERST is deasserted and produces human readable output of
 the process.

 It will also add the PCIe Link Training and Status State Machine (LTSSM) tracing
 and details on speed and link width.

 Output looks a bit like this ::

   [    1.096995141,3] PHB#0000[0:0]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
   [    1.102849137,3] PHB#0000[0:0]: TRACE:0x0000102101000000 11ms presence GEN1:x16:polling
   [    1.104341838,3] PHB#0000[0:0]: TRACE:0x0000182101000000 14ms training GEN1:x16:polling
   [    1.104357444,3] PHB#0000[0:0]: TRACE:0x00001c5101000000 14ms training GEN1:x16:recovery
   [    1.104580394,3] PHB#0000[0:0]: TRACE:0x00001c5103000000 14ms training GEN3:x16:recovery
   [    1.123259359,3] PHB#0000[0:0]: TRACE:0x00001c5104000000 51ms training GEN4:x16:recovery
   [    1.141737656,3] PHB#0000[0:0]: TRACE:0x0000144104000000 87ms presence GEN4:x16:L0
   [    1.141752318,3] PHB#0000[0:0]: TRACE:0x0000154904000000 87ms trained  GEN4:x16:L0
   [    1.141757964,3] PHB#0000[0:0]: TRACE: Link trained.
   [    1.096834019,3] PHB#0001[0:1]: TRACE:0x0000001101000000  0ms          GEN1:x16:detect
   [    1.105578525,3] PHB#0001[0:1]: TRACE:0x0000102101000000 17ms presence GEN1:x16:polling
   [    1.112763075,3] PHB#0001[0:1]: TRACE:0x0000183101000000 31ms training GEN1:x16:config
   [    1.112778956,3] PHB#0001[0:1]: TRACE:0x00001c5081000000 31ms training GEN1:x08:recovery
   [    1.113002083,3] PHB#0001[0:1]: TRACE:0x00001c5083000000 31ms training GEN3:x08:recovery
   [    1.114833873,3] PHB#0001[0:1]: TRACE:0x0000144083000000 35ms presence GEN3:x08:L0
   [    1.114848832,3] PHB#0001[0:1]: TRACE:0x0000154883000000 35ms trained  GEN3:x08:L0
   [    1.114854650,3] PHB#0001[0:1]: TRACE: Link trained.

 Enabled via NVRAM: ::

   nvram -p ibm,skiboot --update-config pci-tracing=true

 Named after the person the output of this mode is typically sent to.


 **WARNING**: The documentation below **urgently needs updating** and is *woefully* incomplete.

 IODA PE Setup Sequences
 -----------------------

 (**WARNING**: this was rescued from old internal documentation. Needs verification)

 To setup basic PE mappings, the host performs this basic sequence:

 For ibm,opal-ioda2, prior to allocating PHB resources to PEs, the host must
 allocate memory for PE structures and then calls
 ``opal_pci_set_phb_table_memory( phb_id, rtt_addr, ivt_addr, ivt_len, rrba_addr, peltv_addr)`` to define them to the PHB. OPAL returns ``OPAL_UNSUPPORTED`` status for ``ibm,opal-ioda`` PHBs.

 The host calls ``opal_pci_set_pe( phb_id, pe_number, bus, dev, func, validate_mask, bus_mask, dev_mask, func mask)`` to map a PE to a PCI RID or range of RIDs in the same PE domain.

 The host calls ``opal_pci_set_peltv(phb_id, parent_pe, child_pe, state)`` to
 set a parent PELT vector bit for the child PE argument to 1 (a child of the
 parent) or 0 (not in the parent PE domain).

 IODA MMIO Setup Sequences
 -------------------------

 (**WARNING**: this was rescued from old internal documentation. Needs verification)


 The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x0)`` to disable the MMIO window.

 The host calls ``opal_pci_set_phb_mmio_window( phb_id, mmio_window, starting_real_address, starting_pci_address, segment_size)`` to change the MMIO window location in PCI and/or processor real address space, or to change the size -- and corresponding window size -- of a particular MMIO window.

 The host calls ``opal_pci_map_pe_mmio_window( pe_number, mmio_window, segment_number)`` to map PEs to window segments, for each segment mapped to each PE.

 The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x1)`` to enable the MMIO window.

 IODA MSI Setup Sequences
 ------------------------

 (**WARNING**: this was rescued from old internal documentation. Needs verification)

 To setup MSIs:

 1. For ibm,opal-ioda PHBs, the host chooses an MVE for a PE to use and calls ``opal_pci_set_mve( phb_id, mve_number, pe_number,)`` to setup the MVE for the PE number. HAL treats this call as a NOP and returns hal_success status for ibm,opal-ioda2 PHBs.
 2. The host chooses an XIVE to use with a PE and calls
    a. ``opal_pci_set_xive_pe( phb_id, xive_number, pe_number)`` to authorize that PE to signal that XIVE as an interrupt. The host must call this function for each XIVE assigned to a particular PE, but may use this call for all XIVEs prior to calling ``opel_pci_set_mve()`` to bind the PE XIVEs to an MVE. For MSI conventional, the host must bind a unique MVE for each sequential set of 32 XIVEs.
    b. The host forms the interrupt_source_number from the combination of the device tree MSI property base BUID and XIVE number, as an input to ``opal_set_xive(interrupt_source_number, server_number, priority)`` and ``opal_get_xive(interrupt_source_number, server_number, priority)`` to set or return the server and priority numbers within an XIVE.
    c. ``opal_get_msi_64[32](phb_id, mve_number, xive_num, msi_range, msi_address, message_data)`` to determine the MSI DMA address (32 or 64 bit) and message data value for that xive.

       For MSI conventional, the host uses this for each sequential power of 2 set of 1 to 32 MSIs, to determine the MSI DMA address and starting message data value for that MSI range. For MSI-X, the host calls this uniquely for each MSI interrupt with an msi_range input value of 1.
 3. For ``ibm,opal-ioda`` PHBs, once the MVE and XIVRs are setup for a PE, the host calls ``opal_pci_set_mve_enable( phb_id, mve_number, state)`` to enable that MVE to be a valid target of MSI DMAs. The host may also call this function to disable an MVE when changing PE domains or states.

 IODA DMA Setup Sequences
 ------------------------

 (**WARNING**: this was rescued from old internal documentation. Needs verification)

 To Manage DMA Windows :

 1. The host calls ``opal_pci_map_pe_dma_window( phb_id, dma_window_number, pe_number, tce_levels, tce_table_addr, tce_table_size, tce_page_size, utin64_t* pci_start_addr )`` to setup a DMA window for a PE to translate through a TCE table structure in KVM memory.
 2. The host calls ``opal_pci_map_pe_dma_window_real( phb_id, dma_window_number, pe_number, mem_low_addr, mem_high_addr)`` to setup a DMA window for a PE that is translated (but validated by the PHB as an untranlsated address space authorized to this PE).

 Device Tree Bindings
 --------------------

 See :doc:`device-tree/pci` for device tree information.
	PCI
	===

	Debugging
	---------

	There exist a couple of NVRAM options for enabling extra debug functionality
	to help debug PCI issues. These are not ABI and may be changed or removed at
	any time.

	Verbose EEH
	^^^^^^^^^^^

	::

	nvram -p ibm,skiboot --update-config pci-eeh-verbose=true

	Disable EEH MMIO
	^^^^^^^^^^^^^^^^
	::
	nvram -p ibm,skiboot --update-config pci-eeh-mmio=disabled


	Check for RX errors after link training
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Some PHB4 PHYs can get stuck in a bad state where they are constantly
	retraining the link. This happens transparently to skiboot and Linux
	but will causes PCIe to be slow. Resetting the PHB4 clears the
	problem.

	We can detect this case by looking at the RX errors count where we
	check for link stability. This patch does this by modifying the link
	optimal code to check for RX errors. If errors are occurring we
	retrain the link irrespective of the chip rev or card.

	Normally when this problem occurs, the RX error count is maxed out at
	255. When there is no problem, the count is 0. We chose 8 as the max
	rx errors value to give us some margin for a few errors. There is also
	a knob that can be used to set the error threshold for when we should
	retrain the link. i.e. ::

	nvram -p ibm,skiboot --update-config phb-rx-err-max=8

	Retrain link if degraded
	^^^^^^^^^^^^^^^^^^^^^^^^

	On P9 Scale Out (Nimbus) DD2.0 and Scale in (Cumulus) DD1.0 (and
	below) the PCIe PHY can lockup causing training issues. This can cause
	a degradation in speed or width in ~5% of training cases (depending on
	the card). This is fixed in later chip revisions. This issue can also
	cause PCIe links to not train at all, but this case is already
	handled.

	There is code in skiboot that checks if the PCIe link has trained optimally
	and if not, does a full PHB reset (to fix the PHY lockup) and retrain.

	One complication is some devices are known to train degraded unless
	device specific configuration is performed. Because of this, we only
	retrain when the device is in a whitelist. All devices in the current
	whitelist have been testing on a P9DSU/Boston, ZZ and Witherspoon.

	We always gather information on the link and print it in the logs even
	if the card is not in the whitelist.

	For testing purposes, there's an nvram to retry all PCIe cards and all
	P9 chips when a degraded link is detected. The new option is
	'pci-retry-all=true' which can be set using: ::

	nvram -p ibm,skiboot --update-config pci-retry-all=true

	This option may increase the boot time if used on a badly behaving
	card.

	Maximum link speed
	^^^^^^^^^^^^^^^^^^

	Was useful during bringup on P9 DD1.

	::
	nvram -p ibm,skiboot --update-config pcie-max-link-speed=4


	Ric Mata Mode
	^^^^^^^^^^^^^

	This mode (for PHB4) will trace the training process closely. This activates
	as soon as PERST is deasserted and produces human readable output of
	the process.

	It will also add the PCIe Link Training and Status State Machine (LTSSM) tracing
	and details on speed and link width.

	Output looks a bit like this ::

	[ 1.096995141,3] PHB#0000[0:0]: TRACE:0x0000001101000000 0ms GEN1:x16:detect
	[ 1.102849137,3] PHB#0000[0:0]: TRACE:0x0000102101000000 11ms presence GEN1:x16:polling
	[ 1.104341838,3] PHB#0000[0:0]: TRACE:0x0000182101000000 14ms training GEN1:x16:polling
	[ 1.104357444,3] PHB#0000[0:0]: TRACE:0x00001c5101000000 14ms training GEN1:x16:recovery
	[ 1.104580394,3] PHB#0000[0:0]: TRACE:0x00001c5103000000 14ms training GEN3:x16:recovery
	[ 1.123259359,3] PHB#0000[0:0]: TRACE:0x00001c5104000000 51ms training GEN4:x16:recovery
	[ 1.141737656,3] PHB#0000[0:0]: TRACE:0x0000144104000000 87ms presence GEN4:x16:L0
	[ 1.141752318,3] PHB#0000[0:0]: TRACE:0x0000154904000000 87ms trained GEN4:x16:L0
	[ 1.141757964,3] PHB#0000[0:0]: TRACE: Link trained.
	[ 1.096834019,3] PHB#0001[0:1]: TRACE:0x0000001101000000 0ms GEN1:x16:detect
	[ 1.105578525,3] PHB#0001[0:1]: TRACE:0x0000102101000000 17ms presence GEN1:x16:polling
	[ 1.112763075,3] PHB#0001[0:1]: TRACE:0x0000183101000000 31ms training GEN1:x16:config
	[ 1.112778956,3] PHB#0001[0:1]: TRACE:0x00001c5081000000 31ms training GEN1:x08:recovery
	[ 1.113002083,3] PHB#0001[0:1]: TRACE:0x00001c5083000000 31ms training GEN3:x08:recovery
	[ 1.114833873,3] PHB#0001[0:1]: TRACE:0x0000144083000000 35ms presence GEN3:x08:L0
	[ 1.114848832,3] PHB#0001[0:1]: TRACE:0x0000154883000000 35ms trained GEN3:x08:L0
	[ 1.114854650,3] PHB#0001[0:1]: TRACE: Link trained.

	Enabled via NVRAM: ::

	nvram -p ibm,skiboot --update-config pci-tracing=true

	Named after the person the output of this mode is typically sent to.


	WARNING: The documentation below urgently needs updating and is woefully incomplete.

	IODA PE Setup Sequences
	-----------------------

	(WARNING: this was rescued from old internal documentation. Needs verification)

	To setup basic PE mappings, the host performs this basic sequence:

	For ibm,opal-ioda2, prior to allocating PHB resources to PEs, the host must
	allocate memory for PE structures and then calls
	``opal_pci_set_phb_table_memory( phb_id, rtt_addr, ivt_addr, ivt_len, rrba_addr, peltv_addr)`` to define them to the PHB. OPAL returns ``OPAL_UNSUPPORTED`` status for ``ibm,opal-ioda`` PHBs.

	The host calls ``opal_pci_set_pe( phb_id, pe_number, bus, dev, func, validate_mask, bus_mask, dev_mask, func mask)`` to map a PE to a PCI RID or range of RIDs in the same PE domain.

	The host calls ``opal_pci_set_peltv(phb_id, parent_pe, child_pe, state)`` to
	set a parent PELT vector bit for the child PE argument to 1 (a child of the
	parent) or 0 (not in the parent PE domain).

	IODA MMIO Setup Sequences
	-------------------------

	(WARNING: this was rescued from old internal documentation. Needs verification)


	The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x0)`` to disable the MMIO window.

	The host calls ``opal_pci_set_phb_mmio_window( phb_id, mmio_window, starting_real_address, starting_pci_address, segment_size)`` to change the MMIO window location in PCI and/or processor real address space, or to change the size -- and corresponding window size -- of a particular MMIO window.

	The host calls ``opal_pci_map_pe_mmio_window( pe_number, mmio_window, segment_number)`` to map PEs to window segments, for each segment mapped to each PE.

	The host calls ``opal_pci_phb_mmio_enable( phb_id, window_type, window_num, 0x1)`` to enable the MMIO window.

	IODA MSI Setup Sequences
	------------------------

	(WARNING: this was rescued from old internal documentation. Needs verification)

	To setup MSIs:

	1. For ibm,opal-ioda PHBs, the host chooses an MVE for a PE to use and calls ``opal_pci_set_mve( phb_id, mve_number, pe_number,)`` to setup the MVE for the PE number. HAL treats this call as a NOP and returns hal_success status for ibm,opal-ioda2 PHBs.
	2. The host chooses an XIVE to use with a PE and calls
	a. ``opal_pci_set_xive_pe( phb_id, xive_number, pe_number)`` to authorize that PE to signal that XIVE as an interrupt. The host must call this function for each XIVE assigned to a particular PE, but may use this call for all XIVEs prior to calling ``opel_pci_set_mve()`` to bind the PE XIVEs to an MVE. For MSI conventional, the host must bind a unique MVE for each sequential set of 32 XIVEs.
	b. The host forms the interrupt_source_number from the combination of the device tree MSI property base BUID and XIVE number, as an input to ``opal_set_xive(interrupt_source_number, server_number, priority)`` and ``opal_get_xive(interrupt_source_number, server_number, priority)`` to set or return the server and priority numbers within an XIVE.
	c. ``opal_get_msi_64[32](phb_id, mve_number, xive_num, msi_range, msi_address, message_data)`` to determine the MSI DMA address (32 or 64 bit) and message data value for that xive.

	For MSI conventional, the host uses this for each sequential power of 2 set of 1 to 32 MSIs, to determine the MSI DMA address and starting message data value for that MSI range. For MSI-X, the host calls this uniquely for each MSI interrupt with an msi_range input value of 1.
	3. For ``ibm,opal-ioda`` PHBs, once the MVE and XIVRs are setup for a PE, the host calls ``opal_pci_set_mve_enable( phb_id, mve_number, state)`` to enable that MVE to be a valid target of MSI DMAs. The host may also call this function to disable an MVE when changing PE domains or states.

	IODA DMA Setup Sequences
	------------------------

	(WARNING: this was rescued from old internal documentation. Needs verification)

	To Manage DMA Windows :

	1. The host calls ``opal_pci_map_pe_dma_window( phb_id, dma_window_number, pe_number, tce_levels, tce_table_addr, tce_table_size, tce_page_size, utin64_t* pci_start_addr )`` to setup a DMA window for a PE to translate through a TCE table structure in KVM memory.
	2. The host calls ``opal_pci_map_pe_dma_window_real( phb_id, dma_window_number, pe_number, mem_low_addr, mem_high_addr)`` to setup a DMA window for a PE that is translated (but validated by the PHB as an untranlsated address space authorized to this PE).

	Device Tree Bindings
	--------------------

	See :doc:`device-tree/pci` for device tree information.