doc/opal-api/opal-handle-hmi-98-166.rst - skiboot - Git at Google

 Hypervisor Maintenance Interrupt (HMI)
 ======================================

 Hypervisor Maintenance Interrupt usually reports error related to processor
 recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then
 takes this opportunity to analyze and recover from some of these errors.
 Hypervisor takes assistance from OPAL layer to handle and recover from HMI.
 After handling HMI, OPAL layer sends the summary of error report and status
 of recovery action using HMI event. See ref:`opal-messages` for HMI
 event structure under :ref:`OPAL_MSG_HMI_EVT` section.

 HMI is thread specific. The reason for HMI is available in a per thread
 Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance
 Exception Enable Register (HMEER) is per core. Bits from the HMER need to
 be enabled by the corresponding bits in the HMEER in order to cause an HMI.

 Several interrupt reasons are routed in parallel to each of the thread
 specific copies. Each thread can only clear bits in its own HMER. OPAL
 handler from each thread clears the respective bit from HMER register
 after handling the error.

 List of errors that causes HMI
 ==============================

  - CPU Errors

    - Processor Core checkstop
    - Processor retry recovery
    - NX/NPU/CAPP checkstop.

  - Timer facility Errors

    - ChipTOD Errors

     - ChipTOD sync check and parity errors
     - ChipTOD configuration register parity errors
     - ChiTOD topology failover

  - Timebase (TB) errors

     - TB parity/residue error
     - TFMR parity and firmware control error
     - DEC/HDEC/PURR/SPURR parity errors

 HMI handling
 ============

 A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0).
 OPAL handler scans through Fault Isolation Register (FIR) for each
 core/nx/npu to detect the exact reason for checkstop and reports it back
 to the host alongwith the disposition.

 A processor recovery is reported through HMER bits 2, 3 and 11. These are
 just an informational messages and no extra recovery is required.

 Timer facility errors are reported through HMER bit 4. These are all
 recoverable errors. The exact reason for the errors are stored in
 Timer Facility Management Register (TFMR). Some of the Timer facility
 errors affects TB and some of them affects TOD. TOD is a per chip
 Time-Of-Day logic that holds the actual time value of the chip and
 communicates with every TOD in the system to achieve synchronized
 timer value within a system. TB is per core register (64-bit) derives its
 value from ChipTOD at startup and then it gets periodically incremented
 by STEP signal provided by the TOD. In a multi-socket system TODs are
 always configured as master/backup TOD under primary/secondary
 topology configuration respectively.

 TB error generates HMI on all threads of the affected core. TB errors
 except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running
 making it invalid. As part of TB recovery, OPAL hmi handler synchronizes
 with all threads, clears the TB errors and then re-sync the TB with TOD
 value putting it back in running state.

 TOD errors generates HMI on every core/thread of affected chip. The reason
 for TOD errors are stored in TOD ERROR register (0x40030). As part of the
 recovery OPAL hmi handler clears the TOD error and then requests new TOD
 value from another running chipTOD in the system. Sometimes, if a primary
 chipTOD is in error, it may need a TOD topology switch to recover from
 error. A TOD topology switch basically makes a backup as new active master.

 .. _OPAL_HANDLE_HMI:

 OPAL_HANDLE_HMI
 ===============

 .. code-block:: c

    #define OPAL_HANDLE_HMI	98

    int64_t opal_handle_hmi(void);


 Superseded by :ref:`OPAL_HANDLE_HMI2`, meaning that :ref:`OPAL_HANDLE_HMI`
 should only be called if :ref:`OPAL_HANDLE_HMI2` is not available.

 Since :ref:`OPAL_HANDLE_HMI2` has been available since the start of POWER9
 systems being supported, if you only target POWER9 and above, you can
 assume the presence of :ref:`OPAL_HANDLE_HMI2`.

 .. _OPAL_HANDLE_HMI2:

 OPAL_HANDLE_HMI2
 ================

 .. code-block:: c

    #define OPAL_HANDLE_HMI2	166

    int64_t opal_handle_hmi2(__be64 *out_flags);

 When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call
 :ref:`OPAL_HANDLE_HMI` or :ref:`OPAL_HANDLE_HMI2`. The :ref:`OPAL_HANDLE_HMI`
 is an old interface. :ref:`OPAL_HANDLE_HMI2` is newly introduced opal call
 that returns direct info to the OS. It returns a 64-bit flag mask currently
 set to provide info about which timer facilities were lost, and whether an
 event was generated. This information will help OS to take respective
 actions.

 In case where opal hmi handler is unable to recover from TOD or TB errors,
 it would flag ``OPAL_HMI_FLAGS_TOD_TB_FAIL`` to indicate OS that TB is
 dead. This information then can be used by OS to make sure that the
 functions relying on TB value (e.g. udelay()) are aware of TB not ticking.
 This will avoid OS getting stuck or hang during its way to panic path.


 Parameters
 ^^^^^^^^^^

 .. code-block:: c

    __be64 *out_flags;

 Returns the 64-bit flag mask that provides info about which timer facilities
 were lost, and whether an event was generated.

 .. code-block:: c

    /* OPAL_HANDLE_HMI2 out_flags */
    enum {
         OPAL_HMI_FLAGS_TB_RESYNC        = (1ull << 0), /* Timebase has been resynced */
         OPAL_HMI_FLAGS_DEC_LOST         = (1ull << 1), /* DEC lost, needs to be reprogrammed */
         OPAL_HMI_FLAGS_HDEC_LOST        = (1ull << 2), /* HDEC lost, needs to be reprogrammed */
         OPAL_HMI_FLAGS_TOD_TB_FAIL      = (1ull << 3), /* TOD/TB recovery failed. */
         OPAL_HMI_FLAGS_NEW_EVENT        = (1ull << 63), /* An event has been created */
    };

 .. _OPAL_HMI_FLAGS_TOD_TB_FAIL:

 OPAL_HMI_FLAGS_TOD_TB_FAIL
   The Time of Day (TOD) / Timebase facility has failed. This is probably fatal
   for the OS, and requires the OS to be very careful to not call any function
   that may rely on it, usually as it heads down a `panic()` code path.
   This code path should be :ref:`OPAL_CEC_REBOOT2` with the OPAL_REBOOT_PLATFORM_ERROR
   option. Details of the failure are likely delivered as part of HMI events if
   `OPAL_HMI_FLAGS_NEW_EVENT` is set.
	Hypervisor Maintenance Interrupt (HMI)
	======================================

	Hypervisor Maintenance Interrupt usually reports error related to processor
	recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then
	takes this opportunity to analyze and recover from some of these errors.
	Hypervisor takes assistance from OPAL layer to handle and recover from HMI.
	After handling HMI, OPAL layer sends the summary of error report and status
	of recovery action using HMI event. See ref:`opal-messages` for HMI
	event structure under :ref:`OPAL_MSG_HMI_EVT` section.

	HMI is thread specific. The reason for HMI is available in a per thread
	Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance
	Exception Enable Register (HMEER) is per core. Bits from the HMER need to
	be enabled by the corresponding bits in the HMEER in order to cause an HMI.

	Several interrupt reasons are routed in parallel to each of the thread
	specific copies. Each thread can only clear bits in its own HMER. OPAL
	handler from each thread clears the respective bit from HMER register
	after handling the error.

	List of errors that causes HMI
	==============================

	- CPU Errors

	- Processor Core checkstop
	- Processor retry recovery
	- NX/NPU/CAPP checkstop.

	- Timer facility Errors

	- ChipTOD Errors

	- ChipTOD sync check and parity errors
	- ChipTOD configuration register parity errors
	- ChiTOD topology failover

	- Timebase (TB) errors

	- TB parity/residue error
	- TFMR parity and firmware control error
	- DEC/HDEC/PURR/SPURR parity errors

	HMI handling
	============

	A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0).
	OPAL handler scans through Fault Isolation Register (FIR) for each
	core/nx/npu to detect the exact reason for checkstop and reports it back
	to the host alongwith the disposition.

	A processor recovery is reported through HMER bits 2, 3 and 11. These are
	just an informational messages and no extra recovery is required.

	Timer facility errors are reported through HMER bit 4. These are all
	recoverable errors. The exact reason for the errors are stored in
	Timer Facility Management Register (TFMR). Some of the Timer facility
	errors affects TB and some of them affects TOD. TOD is a per chip
	Time-Of-Day logic that holds the actual time value of the chip and
	communicates with every TOD in the system to achieve synchronized
	timer value within a system. TB is per core register (64-bit) derives its
	value from ChipTOD at startup and then it gets periodically incremented
	by STEP signal provided by the TOD. In a multi-socket system TODs are
	always configured as master/backup TOD under primary/secondary
	topology configuration respectively.

	TB error generates HMI on all threads of the affected core. TB errors
	except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running
	making it invalid. As part of TB recovery, OPAL hmi handler synchronizes
	with all threads, clears the TB errors and then re-sync the TB with TOD
	value putting it back in running state.

	TOD errors generates HMI on every core/thread of affected chip. The reason
	for TOD errors are stored in TOD ERROR register (0x40030). As part of the
	recovery OPAL hmi handler clears the TOD error and then requests new TOD
	value from another running chipTOD in the system. Sometimes, if a primary
	chipTOD is in error, it may need a TOD topology switch to recover from
	error. A TOD topology switch basically makes a backup as new active master.

	.. _OPAL_HANDLE_HMI:

	OPAL_HANDLE_HMI
	===============

	.. code-block:: c

	#define OPAL_HANDLE_HMI 98

	int64_t opal_handle_hmi(void);


	Superseded by :ref:`OPAL_HANDLE_HMI2`, meaning that :ref:`OPAL_HANDLE_HMI`
	should only be called if :ref:`OPAL_HANDLE_HMI2` is not available.

	Since :ref:`OPAL_HANDLE_HMI2` has been available since the start of POWER9
	systems being supported, if you only target POWER9 and above, you can
	assume the presence of :ref:`OPAL_HANDLE_HMI2`.

	.. _OPAL_HANDLE_HMI2:

	OPAL_HANDLE_HMI2
	================

	.. code-block:: c

	#define OPAL_HANDLE_HMI2 166

	int64_t opal_handle_hmi2(__be64 *out_flags);

	When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call
	:ref:`OPAL_HANDLE_HMI` or :ref:`OPAL_HANDLE_HMI2`. The :ref:`OPAL_HANDLE_HMI`
	is an old interface. :ref:`OPAL_HANDLE_HMI2` is newly introduced opal call
	that returns direct info to the OS. It returns a 64-bit flag mask currently
	set to provide info about which timer facilities were lost, and whether an
	event was generated. This information will help OS to take respective
	actions.

	In case where opal hmi handler is unable to recover from TOD or TB errors,
	it would flag ``OPAL_HMI_FLAGS_TOD_TB_FAIL`` to indicate OS that TB is
	dead. This information then can be used by OS to make sure that the
	functions relying on TB value (e.g. udelay()) are aware of TB not ticking.
	This will avoid OS getting stuck or hang during its way to panic path.


	Parameters
	^^^^^^^^^^

	.. code-block:: c

	__be64 *out_flags;

	Returns the 64-bit flag mask that provides info about which timer facilities
	were lost, and whether an event was generated.

	.. code-block:: c

	/* OPAL_HANDLE_HMI2 out_flags */
	enum {
	OPAL_HMI_FLAGS_TB_RESYNC = (1ull << 0), /* Timebase has been resynced */
	OPAL_HMI_FLAGS_DEC_LOST = (1ull << 1), /* DEC lost, needs to be reprogrammed */
	OPAL_HMI_FLAGS_HDEC_LOST = (1ull << 2), /* HDEC lost, needs to be reprogrammed */
	OPAL_HMI_FLAGS_TOD_TB_FAIL = (1ull << 3), /* TOD/TB recovery failed. */
	OPAL_HMI_FLAGS_NEW_EVENT = (1ull << 63), /* An event has been created */
	};

	.. _OPAL_HMI_FLAGS_TOD_TB_FAIL:

	OPAL_HMI_FLAGS_TOD_TB_FAIL
	The Time of Day (TOD) / Timebase facility has failed. This is probably fatal
	for the OS, and requires the OS to be very careful to not call any function
	that may rely on it, usually as it heads down a `panic()` code path.
	This code path should be :ref:`OPAL_CEC_REBOOT2` with the OPAL_REBOOT_PLATFORM_ERROR
	option. Details of the failure are likely delivered as part of HMI events if
	`OPAL_HMI_FLAGS_NEW_EVENT` is set.