| .. _skiboot-6.3: |
| |
| skiboot-6.3 |
| =========== |
| |
| skiboot v6.3 was released on Friday May 3rd 2019. It is the first |
| release of skiboot 6.3, which becomes the new stable release |
| of skiboot following the 6.2 release, first released December 14th 2018. |
| |
| Skiboot 6.3 will mark the basis for op-build v2.3. |
| |
| skiboot v6.3 contains all bug fixes as of :ref:`skiboot-6.0.20`, |
| and :ref:`skiboot-6.2.3` (the currently maintained |
| stable releases). |
| |
| For how the skiboot stable releases work, see :ref:`stable-rules` for details. |
| |
| Over skiboot 6.2, we have the following changes: |
| |
| .. _skiboot-6.3-new-features: |
| |
| New Features |
| ------------ |
| |
| - hw/imc: Enable opal calls to init/start/stop IMC Trace mode |
| |
| New OPAL APIs for In-Memory Collection Counter infrastructure(IMC), |
| including a new device type called OPAL_IMC_COUNTERS_TRACE. |
| - xive: Add calls to save/restore the queues and VPs HW state |
| |
| To be able to support migration of guests using the XIVE native |
| exploitation mode, (where the queue is effectively owned by the |
| guest), KVM needs to be able to save and restore the HW-modified |
| fields of the queue, such as the current queue producer pointer and |
| generation bit, and to retrieve the modified thread context registers |
| of the VP from the NVT structure : the VP interrupt pending bits. |
| |
| However, there is no need to set back the NVT structure on P9. P10 |
| should be the same. |
| - witherspoon: Add nvlink2 interconnect information |
| |
| GPUs on Redbud and Sequoia platforms are interconnected in groups of |
| 2 or 3 GPUs. The problem with that is if the user decides to pass a single |
| GPU from a group to the userspace, we need to ensure that links between |
| GPUs do not get enabled. |
| |
| A V100 GPU provides a way to disable selected links. In order to only |
| disable links to peer GPUs, we need a topology map. |
| |
| This adds an "ibm,nvlink-peers" property to a GPU DT node with phandles |
| of peer GPUs and NVLink2 bridges. The index in the property is a GPU link |
| number. |
| - platforms/romulus: Also support talos |
| |
| The two are similar enough and I'd like to have a slot table for our |
| Talos. |
| - OpenCAPI support! (see :ref:`skiboot-6.3-OpenCAPI` section) |
| - opal/hmi: set a flag to inform OS that TOD/TB has failed. |
| |
| Set a flag to indicate OS about TOD/TB failure as part of new |
| opal_handle_hmi2 handler. This flag then can be used by OS to make sure |
| functions depending on TB value (e.g. udelay()) are aware of TB not |
| ticking. |
| - astbmc: Enable IPMI HIOMAP for AMI platforms |
| |
| Required for Habanero, Palmetto and Romulus. |
| - power-mgmt : occ : Add 'freq-domain-mask' DT property |
| |
| Add a new device-tree property freq-domain-indicator to define group of |
| CPUs which would share same frequency. This property has been added under |
| power-mgmt node. It is a bitmask. |
| |
| Bitwise AND is taken between this bitmask value and PIR of cpu. All the |
| CPUs lying in the same frequency domain will have same result for AND. |
| |
| For example, For POWER9, 0xFFF0 indicates quad wide frequency domain. |
| Taking AND with the PIR of CPUs will yield us frequency domain which is |
| quad wise distribution as last 4 bits have been masked which represent the |
| cores. |
| |
| Similarly, 0xFFF8 will represent core wide frequency domain for P8. |
| |
| Also, Add a new device-tree property domain-runs-at which will denote the |
| strategy OCC is using to change the frequency of a frequency-domain. There |
| can be two strategy - FREQ_MOST_RECENTLY_SET and FREQ_MAX_IN_DOMAIN. |
| |
| FREQ_MOST_RECENTLY_SET : the OCC sets the frequency of the quad to the most |
| recent frequency value requested by the CPUs in the quad. |
| |
| FREQ_MAX_IN_DOMAIN : the OCC sets the frequency of the CPUs in |
| the Quad to the maximum of the latest frequency requested by each of |
| the component cores. |
| - powercap: occ: Fix the powercapping range allowed for user |
| |
| OCC provides two limits for minimum powercap. One being hard powercap |
| minimum which is guaranteed by OCC and the other one is a soft |
| powercap minimum which is lesser than hard-min and may or may not be |
| asserted due to various power-thermal reasons. So to allow the users |
| to access the entire powercap range, this patch exports soft powercap |
| minimum as the "powercap-min" DT property. And it also adds a new |
| DT property called "powercap-hard-min" to export the hard-min powercap |
| limit. |
| - Add NVDIMM support |
| |
| NVDIMMs are memory modules that use a battery backup system to allow the |
| contents RAM to be saved to non-volatile storage if system power goes |
| away unexpectedly. This allows them to be used a high-performance |
| storage device, suitable for serving as a cache for SSDs and the like. |
| |
| Configuration of NVDIMMs is handled by hostboot and communicated to OPAL |
| via the HDAT. We need to parse out the NVDIMM memory ranges and create |
| memory regions with the "pmem-region" compatible label to make them |
| available to the host. |
| - core/exceptions: implement support for MCE interrupts in powersave |
| |
| The ISA specifies that MCE interrupts in power saving modes will enter |
| at 0x200 with powersave bits in SRR1 set. This is not currently |
| supported properly, the MCE will just happen like a normal interrupt, |
| but GPRs could be lost, which would lead to crashes (e.g., r1, r2, r13 |
| etc). |
| |
| So check the power save bits similarly to the sreset vector, and |
| handle this properly. |
| - core/exceptions: allow recoverable sreset exceptions |
| |
| This requires implementing the MSR[RI] bit. Then just allow all |
| non-fatal sreset exceptions to recover. |
| - core/exceptions: implement an exception handler for non-powersave sresets |
| |
| Detect non-powersave sresets and send them to the normal exception |
| handler which prints registers and stack. |
| - Add PVR_TYPE_P9P |
| |
| Enable a new PVR to get us running on another p9 variant. |
| |
| Since v6.3-rc2: |
| |
| - Expose PNOR Flash partitions to host MTD driver via devicetree |
| |
| This makes it possible for the host to directly address each |
| partition without requiring each application to directly parse |
| the FFS headers. This has been in use for some time already to |
| allow BOOTKERNFW partition updates from the host. |
| |
| All partitions except BOOTKERNFW are marked readonly. |
| |
| The BOOTKERNFW partition is currently exclusively used by the TalosII platform |
| |
| - Write boot progress to LPC port 80h |
| |
| This is an adaptation of what we currently do for op_display() on FSP |
| machines, inventing an encoding for what we can write into the single |
| byte at LPC port 80h. |
| |
| Port 80h is often used on x86 systems to indicate boot progress/status |
| and dates back a decent amount of time. Since a byte isn't exactly very |
| expressive for everything that can go on (and wrong) during boot, it's |
| all about compromise. |
| |
| Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment |
| display that display these codes. So far, this has only been driven by |
| hostboot (see hostboot commit 90ec2e65314c). |
| |
| - Write boot progress to LPC ports 81 and 82 |
| |
| There's a thought to write more extensive boot progress codes to LPC |
| ports 81 and 82 to supplement/replace any reliance on port 80. |
| |
| We want to still emit port 80 for platforms like Zaius and Barreleye |
| that have the physical display. Ports 81 and 82 can be monitored by a |
| BMC though. |
| |
| - Add Talos II platform |
| |
| Talos II has some hardware differences from Romulus, therefore |
| we cannot guarantee Talos II == Romulus in skiboot. Copy and |
| slightly modify the Romulus files for Talos II. |
| |
| Since v6.3-rc1: |
| |
| - cpufeatures: Add tm-suspend-hypervisor-assist and tm-suspend-xer-so-bug node |
| |
| tm-suspend-hypervisor-assist for P9 >=DD2.2 |
| And a tm-suspend-xer-so-bug node for P9 DD2.2 only. |
| |
| I also treat P9P as P9 DD2.3 and add a unit test for the cpufeatures |
| infrastructure. |
| |
| Fixes: https://github.com/open-power/skiboot/issues/233 |
| |
| |
| Deprecated/Removed Features |
| --------------------------- |
| |
| - opal: Deprecate reading the PHB status |
| |
| The OPAL_PCI_EEH_FREEZE_STATUS call takes a bunch of parameters, one of |
| them is @phb_status. It is defined as __be64* and always NULL in |
| the current Linux upstream but if anyone ever decides to read that status, |
| then the PHB3's handler will assume it is struct OpalIoPhb3ErrorData* |
| (which is a lot bigger than 8 bytes) and zero it causing the stack |
| corruption; p7ioc-phb has the same issue. |
| |
| This removes @phb_status from all eeh_freeze_status() hooks and moves |
| the error message from PHB4 to the affected OPAL handlers. |
| |
| As far as we can tell, nobody has ever used this and thus it's safe to remove. |
| - Remove POWER9N DD1 support |
| |
| This is not a shipping product and is no longer supported by Linux |
| or other firmware components. |
| |
| Since v6.3-rc3: |
| |
| - Disable fast-reset for POWER8 |
| |
| There is a bug with fast-reset when CPU cores are busy, which can be |
| reproduced by running `stress` and then trying `reboot -ff` (this is |
| what the op-test test cases FastRebootHostStress and |
| FastRebootHostStressTorture do). What happens is the cores lock up, |
| which isn't the best thing in the world when you want them to start |
| executing instructions again. |
| |
| A workaround is to use instruction ramming, which while greatly |
| increasing the reliability of fast-reset on p8, doesn't make it perfect. |
| |
| Instruction ramming is what pdbg was modified to do in order to have the |
| sreset functionality work reliably on p8. |
| pdbg patches: https://patchwork.ozlabs.org/project/pdbg/list/?series=96593&state=* |
| |
| Fixes: https://github.com/open-power/skiboot/issues/185 |
| |
| General |
| ------- |
| |
| - core/i2c: Various bits of refactoring |
| - refactor backtrace generation infrastructure |
| - astbmc: Handle failure to initialise raw flash |
| |
| Initialising raw flash lead to a dead assignment to rc. Check the return |
| code and take the failure path as necessary. Both before and after the |
| fix we see output along the lines of the following when flash_init() |
| fails: :: |
| |
| [ 53.283182881,7] IRQ: Registering 0800..0ff7 ops @0x300d4b98 (data 0x3052b9d8) |
| [ 53.283184335,7] IRQ: Registering 0ff8..0fff ops @0x300d4bc8 (data 0x3052b9d8) |
| [ 53.283185513,7] PHB#0000: Initializing PHB... |
| [ 53.288260827,4] FLASH: Can't load resource id:0. No system flash found |
| [ 53.288354442,4] FLASH: Can't load resource id:1. No system flash found |
| [ 53.342933439,3] CAPP: Error loading ucode lid. index=200ea |
| [ 53.462749486,2] NVRAM: Failed to load |
| [ 53.462819095,2] NVRAM: Failed to load |
| [ 53.462894236,2] NVRAM: Failed to load |
| [ 53.462967071,2] NVRAM: Failed to load |
| [ 53.463033077,2] NVRAM: Failed to load |
| [ 53.463144847,2] NVRAM: Failed to load |
| |
| Eventually followed by: :: |
| |
| [ 57.216942479,5] INIT: platform wait for kernel load failed |
| [ 57.217051132,5] INIT: Assuming kernel at 0x20000000 |
| [ 57.217127508,3] INIT: ELF header not found. Assuming raw binary. |
| [ 57.217249886,2] NVRAM: Failed to load |
| [ 57.221294487,0] FATAL: Kernel is zeros, can't execute! |
| [ 57.221397429,0] Assert fail: core/init.c:615:0 |
| [ 57.221471414,0] Aborting! |
| CPU 0028 Backtrace: |
| S: 0000000031d43c60 R: 000000003001b274 ._abort+0x4c |
| S: 0000000031d43ce0 R: 000000003001b2f0 .assert_fail+0x34 |
| S: 0000000031d43d60 R: 0000000030014814 .load_and_boot_kernel+0xae4 |
| S: 0000000031d43e30 R: 0000000030015164 .main_cpu_entry+0x680 |
| S: 0000000031d43f00 R: 0000000030002718 boot_entry+0x1c0 |
| --- OPAL boot --- |
| |
| Analysis of the execution paths suggests we'll always "safely" end this |
| way due the setup sequence for the blocklevel callbacks in flash_init() |
| and error handling in blocklevel_get_info(), and there's no current risk |
| of executing from unexpected memory locations. As such the issue is |
| reduced to down to a fix for poor error hygene in the original change |
| and a resolution for a Coverity warning (famous last words etc). |
| - core/flash: Retry requests as necessary in flash_load_resource() |
| |
| We would like to successfully boot if we have a dependency on the BMC |
| for flash even if the BMC is not current ready to service flash |
| requests. On the assumption that it will become ready, retry for several |
| minutes to cover a BMC reboot cycle and *eventually* rather than |
| *immediately* crash out with: :: |
| |
| [ 269.549748] reboot: Restarting system |
| [ 390.297462587,5] OPAL: Reboot request... |
| [ 390.297737995,5] RESET: Initiating fast reboot 1... |
| [ 391.074707590,5] Clearing unused memory: |
| [ 391.075198880,5] PCI: Clearing all devices... |
| [ 391.075201618,7] Clearing region 201ffe000000-201fff800000 |
| [ 391.086235699,5] PCI: Resetting PHBs and training links... |
| [ 391.254089525,3] FFS: Error 17 reading flash header |
| [ 391.254159668,3] FLASH: Can't open ffs handle: 17 |
| [ 392.307245135,5] PCI: Probing slots... |
| [ 392.363723191,5] PCI Summary: |
| ... |
| [ 393.423255262,5] OCC: All Chip Rdy after 0 ms |
| [ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at |
| 0x30800a88 390645 bytes |
| [ 393.453202605,0] FATAL: Kernel is zeros, can't execute! |
| [ 393.453247064,0] Assert fail: core/init.c:593:0 |
| [ 393.453289682,0] Aborting! |
| CPU 0040 Backtrace: |
| S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c |
| S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34 |
| S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30 |
| S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c |
| S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c |
| --- OPAL boot --- |
| |
| The OPAL flash API hooks directly into the blocklevel layer, so there's |
| no delay for e.g. the host kernel, just for asynchronously loaded |
| resources during boot. |
| - fast-reboot: occ: Call occ_pstates_init() on fast-reset on all machines |
| |
| Commit 815417dcda2e ("init, occ: Initialise OCC earlier on BMC systems") |
| conditionally invoked occ_pstates_init() only on FSP based systems in |
| load_and_boot_kernel(). Due to this pstate table is re-parsed on FSP |
| system and skipped on BMC system during fast-reboot. So this patch fixes |
| this by invoking occ_pstates_init() on all boxes during fast-reboot. |
| - opal/hmi: Don't retry TOD recovery if it is already in failed state. |
| |
| On TOD failure, all cores/thread receives HMI and very first thread that |
| gets interrupt fixes the TOD where as others just resets the respective |
| HMER error bit and return. But when TOD is unrecoverable, all the threads |
| try to do TOD recovery one by one causing threads to spend more time inside |
| opal. Set a global flag when TOD is unrecoverable so that rest of the |
| threads go back to linux immediately avoiding lock ups in system |
| reboot/panic path. |
| - hw/bt: Do not disable ipmi message retry during OPAL boot |
| |
| Currently OPAL doesn't know whether BMC is functioning or not. If BMC is |
| down (like BMC reboot), then we keep on retry sending message to BMC. So |
| in some corner cases we may hit hard lockup issue in kernel. |
| |
| Ideally we should avoid using synchronous path as much as possible. But |
| for now commit 01f977c3 added option to disable message retry in synchronous. |
| But this fix is not required during boot. Hence lets disable IPMI message |
| retry during OPAL boot. |
| - hdata/memory: Fix warning message |
| |
| Even though we added memory to device tree, we are getting below warning. :: |
| |
| [ 57.136949696,3] Unable to use memory range 0 from MSAREA 0 |
| [ 57.137049753,3] Unable to use memory range 0 from MSAREA 1 |
| [ 57.137152335,3] Unable to use memory range 0 from MSAREA 2 |
| [ 57.137251218,3] Unable to use memory range 0 from MSAREA 3 |
| - hw/bt: Add backend interface to disable ipmi message retry option |
| |
| During boot OPAL makes IPMI_GET_BT_CAPS call to BMC to get BT interface |
| capabilities which includes IPMI message max resend count, message |
| timeout, etc,. Most of the time OPAL gets response from BMC within |
| specified timeout. In some corner cases (like mboxd daemon reset in BMC, |
| BMC reboot, etc) OPAL may not get response within timeout period. In |
| such scenarios, OPAL resends message until max resend count reaches. |
| |
| OPAL uses synchronous IPMI message (ipmi_queue_msg_sync()) for few |
| operations like flash read, write, etc. Thread will wait in OPAL until |
| it gets response from BMC. In some corner cases like BMC reboot, thread |
| may wait in OPAL for long time (more than 20 seconds) and results in |
| kernel hardlockup. |
| |
| This patch introduces new interface to disable message resend option. We |
| will disable message resend option for synchrous message. This will |
| greatly reduces kernel hardlock up issues. |
| |
| This is short term fix. Long term solution is to convert all synchronous |
| messages to asynhrounous one. |
| - ipmi/power: Fix system reboot issue |
| |
| Kernel makes reboot/shudown OPAL call for reboot/shutdown. Once kernel |
| gets response from OPAL it runs opal_poll_events() until firmware |
| handles the request. |
| |
| On BMC based system, OPAL makes IPMI call (IPMI_CHASSIS_CONTROL) to |
| initiate system reboot/shutdown. At present OPAL queues IPMI messages |
| and return SUCESS to Host. If BMC is not ready to accept command (like |
| BMC reboot), then these message will fail. We have to manually |
| reboot/shutdown the system using BMC interface. |
| |
| This patch adds logic to validate message return value. If message failed, |
| then it will resend the message. At some stage BMC will be ready to accept |
| message and handles IPMI message. |
| - firmware-versions: Add test case for parsing VERSION |
| |
| Also make it possible to use with afl-lop/afl-fuzz just to help make |
| *sure* we're all good. |
| |
| Additionally, if we hit a entry in VERSION that is larger than our |
| buffer size, we skip over it gracefully rather than overwriting the |
| stack. This is only a problem if VERSION isn't trusted, which as of |
| 4b8cc05a94513816d43fb8bd6178896b430af08f it is verified as part of |
| Secure Boot. |
| - core/fast-reboot: improve NMI handling during fast reset |
| |
| Improve sreset and MCE handling in fast reboot. Switch the HILE bit |
| off before copying OPAL's exception vectors, so NMIs can be handled |
| properly. Also disable MSR[ME] while the vectors are being overwritten |
| - core/cpu: HID update race |
| |
| If the per-core HID register is updated concurrently by multiple |
| threads, updates can get lost. This has been observed during fast |
| reboot where the HILE bit does not get cleared on all cores, which |
| can cause machine check exception interrupts to crash. |
| |
| Fix this by only updating HID on thread0. |
| - SLW: Print verbose info on errors only |
| |
| Change print level from debug to warning for reporting |
| bad EC_PPM_SPECIAL_WKUP_* scom values. To reduce cluttering |
| in the log print only on error. |
| |
| Since v6.3-rc2: |
| |
| - hw/xscom: add missing P9P chip name |
| - asm/head: balance branches to avoid link stack predictor mispredicts |
| |
| The Linux wrapper for OPAL call and return is arranged like this: :: |
| |
| __opal_call: |
| mflr r0 |
| std r0,PPC_STK_LROFF(r1) |
| LOAD_REG_ADDR(r11, opal_return) |
| mtlr r11 |
| hrfid -> OPAL |
| |
| opal_return: |
| ld r0,PPC_STK_LROFF(r1) |
| mtlr r0 |
| blr |
| |
| When skiboot returns to Linux, it branches to LR (i.e., opal_return) |
| with a blr. This unbalances the link stack predictor and will cause |
| mispredicts back up the return stack. |
| - external/mambo: also invoke readline for the non-autorun case |
| - asm/head.S: set POWER9 radix HID bit at entry |
| |
| When running in virtual memory mode, the radix MMU hid bit should not |
| be changed, so set this in the initial boot SPR setup. |
| |
| As a side effect, fast reboot also has HID0:RADIX bit set by the |
| shared spr init, so no need for an explicit call. |
| - build: link with --orphan-handling=warn |
| |
| The linker can warn when the linker script does not explicitly place |
| all sections. These orphan sections are placed according to |
| heuristics, which may not always be desirable. Enable this warning. |
| - build: -fno-asynchronous-unwind-tables |
| |
| skiboot does not use unwind tables, this option saves about 100kB, |
| mostly from .text. |
| - opal/hmi: Initialize the hmi event with old value of TFMR. |
| |
| Do this before we fix TFAC errors. Otherwise the event at host console |
| shows no thread error reported in TFMR register. |
| |
| Without this patch the console event show TFMR with no thread error: |
| (DEC parity error TFMR[59] injection) :: |
| |
| [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] |
| [ 53.737596] Error detail: Timer facility experienced an error |
| [ 53.737611] HMER: 0840000000000000 |
| [ 53.737621] TFMR: 3212000870e04000 |
| |
| After this patch it shows old TFMR value on host console: :: |
| |
| [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] |
| [ 2302.267305] Error detail: Timer facility experienced an error |
| [ 2302.267320] HMER: 0840000000000000 |
| [ 2302.267330] TFMR: 3212000870e14010 |
| |
| |
| IBM FSP based platforms |
| ----------------------- |
| |
| - platforms/firenze: Rework I2C controller fixups |
| - platforms/zz: Re-enable LXVPD slot information parsing |
| |
| From memory this was disabled in the distant past since we were waiting |
| for an updates to the LXPVD format. It looks like that never happened |
| so re-enable it for the ZZ platform so that we can get PCI slot location |
| codes on ZZ. |
| |
| HIOMAP |
| ------ |
| - astbmc: Try IPMI HIOMAP for P8 |
| |
| The HIOMAP protocol was developed after the release of P8 in preparation |
| for P9. As a consequence P9 always uses it, but it has rarely been |
| enabled for P8. P8DTU has recently added IPMI HIOMAP support to its BMC |
| firmware, so enable its use in skiboot with P8 machines. Doing so |
| requires some rework to ensure fallback works correctly as in the past |
| the fallback was to mbox, which will only work for P9. |
| - libflash/ipmi-hiomap: Enforce message size for empty response |
| |
| The protocol defines the response to the associated messages as empty |
| except for the command ID and sequence fields. If the BMC is returning |
| extra data consider the message malformed. |
| - libflash/ipmi-hiomap: Remove unused close handling |
| |
| Issuing a HIOMAP_C_CLOSE is not required by the protocol specification, |
| rather a close can be implicit in a subsequent |
| CREATE_{READ,WRITE}_WINDOW request. The implicit close provides an |
| opportunity to reduce LPC traffic and the implementation takes up that |
| optimisation, so remove the case from the IPMI callback handler. |
| - libflash/ipmi-hiomap: Overhaul event handling |
| |
| Reworking the event handling was inspired by a bug report by Vasant |
| where the host would get wedged on multiple flash access attempts in the |
| face of a persistent error state on the BMC-side. The cause of this bug |
| was the early-exit based on ctx->update, which erronously assumed that |
| all events had been completely handled in prior calls to |
| ipmi_hiomap_handle_events(). This is not true if e.g. |
| HIOMAP_E_DAEMON_READY is clear in the prior calls. |
| |
| Regardless, there were other correctness and efficiency problems with |
| the handling strategy: |
| |
| * Ack-able event state was not restored in the face of errors in the |
| process of re-establishing protocol state |
| * It forced needless window restoration with respect to the context in |
| which ipmi_hiomap_handle_events() was called. |
| * Tests for HIOMAP_E_DAEMON_READY and HIOMAP_E_FLASH_LOST were redundant |
| with the overhauled error handling introduced in the previous patch |
| |
| Fix all of the above issues and add comments to explain the event |
| handling flow. |
| - libflash/ipmi-hiomap: Overhaul error handling |
| |
| The aim is to improve the robustness with respect to absence of the |
| BMC-side daemon. The current error handling roughly mirrors what was |
| done for the mailbox implementation, but there's room for improvement. |
| |
| Errors are split into two classes, those that affect the transport state |
| and those that affect the window validity. From here, we push the |
| transport state error checks right to the bottom of the stack, to ensure |
| the link is known to be in a good state before any message is sent. |
| Window validity tests remain as they were in the hiomap_window_move() |
| and ipmi_hiomap_read() functions. Validity tests are not necessary in |
| the write and erase paths as we will receive an error response from the |
| BMC when performing a dirty or flush on an invalid window. |
| |
| Recovery also remains as it was, done on entry to the blocklevel |
| callbacks. If an error state is encountered in the middle of an |
| operation no attempt is made to recover it on the spot, instead the |
| error is returned up the stack and the caller can choose how it wishes |
| to respond. |
| - libflash/ipmi-hiomap: Fix leak of msg in callback |
| |
| Since v6.3-rc1: |
| |
| - libflash/ipmi-hiomap: Fix blocks count issue |
| |
| We convert data size to block count and pass block count to BMC. |
| If data size is not block aligned then we endup sending block count |
| less than actual data. BMC will write partial data to flash memory. |
| |
| Sample log :: |
| |
| [ 594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8 |
| [ 594.398756487,7] HIOMAP: Flushed writes |
| [ 594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970 |
| [ 594.419897507,7] HIOMAP: Flushed writes |
| |
| In this case HIOMAP sent data with block count=0 and hence BMC didn't |
| flush data to flash. |
| |
| |
| |
| POWER8 |
| ------ |
| - hw/phb3/naples: Disable D-states |
| |
| Putting "Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013]" |
| (more precisely, the second of 2 its PCI functions, no matter in what |
| order) into the D3 state causes EEH with the "PCT timeout" error. |
| This has been noticed on garrison machines only and firestones do not |
| seem to have this issue. |
| |
| This disables D-states changing for devices on root buses on Naples by |
| installing a config space access filter (copied from PHB4). |
| - cpufeatures: Always advertise POWER8NVL as DD2 |
| |
| Despite the major version of PVR being 1 (0x004c0100) for POWER8NVL, |
| these chips are functionally equalent to P8/P8E DD2 levels. |
| |
| This advertises POWER8NVL as DD2. As the result, skiboot adds |
| ibm,powerpc-cpu-features/processor-control-facility for such CPUs and |
| the linux kernel can use hypervisor doorbell messages to wake secondary |
| threads; otherwise "KVM: CPU %d seems to be stuck" would appear because |
| of missing LPCR_PECEDH. |
| |
| p8dtu Platform |
| ^^^^^^^^^^^^^^ |
| - p8dtu: Configure BMC graphics |
| |
| We can no-longer read the values from the BMC in the way we have in the |
| past. Values were provided by Eric Chen of SMC. |
| - p8dtu: Enable HIOMAP support |
| |
| Vesnin Platform |
| ^^^^^^^^^^^^^^^ |
| - platforms/vesnin: Disable PCIe port bifurcation |
| |
| PCIe ports connected to CPU1 and CPU3 now work as x16 instead of x8x8. |
| |
| - Fix hang in pnv_platform_error_reboot path due to TOD failure. |
| |
| On TOD failure, with TB stuck, when linux heads down to |
| pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic |
| cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest |
| all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. |
| But with panic cpu stuck inside OPAL, linux never recovers/reboot. :: |
| |
| p0 c1 t0 |
| NIA : 0x000000003001dd3c <.time_wait+0x64> |
| CFAR : 0x000000003001dce4 <.time_wait+0xc> |
| MSR : 0x9000000002803002 |
| LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> |
| |
| STACK: SP NIA |
| 0x0000000031c236e0 0x0000000031c23760 (big-endian) |
| 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> |
| 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> |
| 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> |
| 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> |
| 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> |
| 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> |
| 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> |
| 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> |
| 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134> |
| 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80> |
| 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94> |
| 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0> |
| 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4> |
| 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140> |
| 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c> |
| 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68> |
| 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8> |
| 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8> |
| 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88> |
| 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164> |
| 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c> |
| |
| This is because, there is a while loop towards the end of |
| ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match |
| with "msg". It loops over time_wait_ms() until exit condition is met. In |
| normal scenario time_wait_ms() calls run pollers so that ipmi backend gets |
| a chance to check ipmi response and set sync_msg to NULL. :: |
| |
| while (sync_msg == msg) |
| time_wait_ms(10); |
| |
| But in the event when TB is in failed state time_wait_ms()->time_wait_poll() |
| returns immediately without calling pollers and hence we end up looping |
| forever. This patch fixes this hang by calling opal_run_pollers() in TB |
| failed state as well. |
| |
| |
| .. _skiboot-6.3-power9: |
| |
| POWER9 |
| ------ |
| |
| - Retry link training at PCIe GEN1 if presence detected but training repeatedly failed |
| |
| Certain older PCIe 1.0 devices will not train unless the training process starts at GEN1 speeds. |
| As a last resort when a device will not train, fall back to GEN1 speed for the last training attempt. |
| |
| This is verified to fix devices based on the Conexant CX23888 on the Talos II platform. |
| - hw/phb4: Drop FRESET_DEASSERT_DELAY state |
| |
| The delay between the ASSERT_DELAY and DEASSERT_DELAY states is set to |
| one timebase tick. This state seems to have been a hold over from PHB3 |
| where it was used to add a 1s delay between de-asserting PERST and |
| polling the link for the CAPI FPGA. There's no requirement for that here |
| since the link polling on PHB4 is a bit smarter so we should be fine. |
| - hw/phb4: Factor out PERST control |
| |
| Some time ago Mikey added some code work around a bug we found where a |
| certain RAID card wouldn't come back again after a fast-reboot. The |
| workaround is setting the Link Disable bit before asserting PERST and |
| clear it after de-asserting PERST. |
| |
| Currently we do this in the FRESET path, but not in the CRESET path. |
| This patch moves the PERST control into its own function to reduce |
| duplication and to the workaround is applied in all circumstances. |
| - hw/phb4: Remove FRESET presence check |
| |
| When we do an freset the first step is to check if a card is present in |
| the slot. However, this only occurs when we enter phb4_freset() with the |
| slot state set to SLOT_NORMAL. This occurs in: |
| |
| a) The creset path, and |
| b) When the OS manually requests an FRESET via an OPAL call. |
| |
| (a) is problematic because in the boot path the generic code will put the |
| slot into FRESET_START manually before calling into phb4_freset(). This |
| can result in a situation where a device is detected on boot, but not |
| after a CRESET. |
| |
| I've noticed this occurring on systems where the PHB's slot presence |
| detect signal is not wired to an adapter. In this situation we can rely |
| on the in-band presence mechanism, but the presence check will make |
| us exit before that has a chance to work. |
| |
| Additionally, if we enter from the CRESET path this early exit leaves |
| the slot's PERST signal being left asserted. This isn't currently an issue, |
| but if we want to support hotplug of devices into the root port it will |
| be. |
| - hw/phb4: Skip FRESET PERST when coming from CRESET |
| |
| PERST is asserted at the beginning of the CRESET process to prevent |
| the downstream device from interacting with the host while the PHB logic |
| is being reset and re-initialised. There is at least a 100ms wait during |
| the CRESET processing so it's not necessary to wait this time again |
| in the FRESET handler. |
| |
| This patch extends the delay after re-setting the PHB logic to extend |
| to the 250ms PERST wait period that we typically use and sets the |
| skip_perst flag so that we don't wait this time again in the FRESET |
| handler. |
| - hw/phb4: Look for the hub-id from in the PBCQ node |
| |
| The hub-id is stored in the PBCQ node rather than the stack node so we |
| never add it to the PHB node. This breaks the lxvpd slot lookup code |
| since the hub-id is encoded in the VPD record that we need to find the |
| slot information. |
| - hdata/iohub: Look for IOVPD on P9 |
| |
| P8 and P9 use the same IO VPD setup, so we need to load the IOHUB VPD on |
| P9 systems too. |
| |
| Since v6.3-rc2: |
| |
| - hw/phb4: Squash the IO bridge window |
| |
| The PCI-PCI bridge spec says that bridges that implement an IO window |
| should hardcode the IO base and limit registers to zero. |
| Unfortunately, these registers only define the upper bits of the IO |
| window and the low bits are assumed to be 0 for the base and 1 for the |
| limit address. As a result, setting both to zero can be mis-interpreted |
| as a 4K IO window. |
| |
| This patch fixes the problem the same way PHB3 does. It sets the IO base |
| and limit values to 0xf000 and 0x1000 respectively which most software |
| interprets as a disabled window. |
| |
| lspci before patch: :: |
| |
| 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) |
| I/O behind bridge: 00000000-00000fff |
| |
| lspci after patch: :: |
| |
| 0000:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode]) |
| I/O behind bridge: None |
| |
| - hw/xscom: Enable sw xstop by default on p9 |
| |
| This was disabled at some point during bringup to make life easier for |
| the lab folks trying to debug NVLink issues. This hack really should |
| have never made it out into the wild though, so we now have the |
| following situation occuring in the field: |
| |
| 1) A bad happens |
| 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to |
| request a platform reboot. |
| 3) OPAL rejects the reboot attempt and returns to the kernel with |
| OPAL_PARAMETER. |
| 4) Kernel panics and attempts to kexec into a kdump kernel. |
| |
| A side effect of the HMI seems to be CPUs becoming stuck which results |
| in the initialisation of the kdump kernel taking a extremely long time |
| (6+ hours). It's also been observed that after performing a dump the |
| kdump kernel then crashes itself because OPAL has ended up in a bad |
| state as a side effect of the HMI. |
| |
| All up, it's not very good so re-enable the software checkstop by |
| default. If people still want to turn it off they can using the nvram |
| override. |
| |
| |
| CAPI2 |
| ^^^^^ |
| - capp/phb4: Prevent HMI from getting triggered when disabling CAPP |
| |
| While disabling CAPP an HMI gets triggered as soon as ETU is put in |
| reset mode. This is caused as before we can disabled CAPP, it detects |
| PHB link going down and triggers an HMI requesting Opal to perform |
| CAPP recovery. This has an un-intended side effect of spamming the |
| Opal logs with malfunction alert messages and may also confuse the |
| user. |
| |
| To prevent this we mask the CAPP FIR error 'PHB Link Down' Bit(31) |
| when we are disabling CAPP just before we put ETU in reset in |
| phb4_creset(). Also now since bringing down the PHB link now wont |
| trigger an HMI and CAPP recovery, hence we manually set the |
| PHB4_CAPP_RECOVERY flag on the phb to force recovery during creset. |
| |
| - phb4/capp: Implement sequence to disable CAPP and enable fast-reset |
| |
| We implement h/w sequence to disable CAPP in disable_capi_mode() and |
| with it also enable fast-reset for CAPI mode in phb4_set_capi_mode(). |
| |
| Sequence to disable CAPP is executed in three phases. The first two |
| phase is implemented in disable_capi_mode() where we reset the CAPP |
| registers followed by PEC registers to their init values. The final |
| third final phase is to reset the PHB CAPI Compare/Mask Register and |
| is done in phb4_init_ioda3(). The reason to move the PHB reset to |
| phb4_init_ioda3() is because by the time Opal PCI reset state machine |
| reaches this function the PHB is already un-fenced and its |
| configuration registers accessible via mmio. |
| - capp/phb4: Force CAPP to PCIe mode during kernel shutdown |
| |
| This patch introduces a new opal syncer for PHB4 named |
| phb4_host_sync_reset(). We register this opal syncer when CAPP is |
| activated successfully in phb4_set_capi_mode() so that it will be |
| called at kernel shutdown during fast-reset. |
| |
| During kernel shutdown the function will then repeatedly call |
| phb->ops->set_capi_mode() to switch switch CAPP to PCIe mode. In case |
| set_capi_mode() indicates its OPAL_BUSY, which indicates that CAPP is |
| still transitioning to new state; it calls slot->ops.run_sm() to |
| ensure that Opal slot reset state machine makes forward progress. |
| |
| |
| Witherspoon Platform |
| ^^^^^^^^^^^^^^^^^^^^ |
| - platforms/witherspoon: Make PCIe shared slot error message more informative |
| |
| If we're missing chips for some reason, we print a warning when configuring |
| the PCIe shared slot. |
| |
| The warning doesn't really make it clear what "shared slot" is, and if it's |
| printed, it'll come right after a bunch of messages about NPU setup, so |
| let's clarify the message to explicitly mention PCI. |
| - witherspoon: Add nvlink2 interconnect information |
| |
| See :ref:`skiboot-6.3-new-features` for details. |
| |
| Zaius Platform |
| ^^^^^^^^^^^^^^ |
| |
| - zaius: Add BMC description |
| |
| Frederic reported that Zaius was failing with a NULL dereference when |
| trying to initialise IPMI HIOMAP. It turns out that the BMC wasn't |
| described at all, so add a description. |
| |
| p9dsu platform |
| ^^^^^^^^^^^^^^ |
| - p9dsu: Fix p9dsu default variant |
| |
| Add the default when no riser_id is returned from the ipmi query. |
| |
| Allow a little more time for BMC reply and cleanup some label strings. |
| |
| |
| PCIe |
| ---- |
| |
| See :ref:`skiboot-6.3-power9` for POWER9 specific PCIe changes. |
| |
| - core/pcie-slot: Don't bail early in the power on case |
| |
| Exiting early in the power off case makes sense since we can't disable |
| slot power (or assert PERST) for suprise hotplug slots. However, we |
| should not exit early in the power-on case since it's possible slot |
| power may have been disabled (or just not enabled at boot time). |
| - firenze-pci: Always init slot info from LXVPD |
| |
| We can slot information from the LXVPD without having power control |
| information about that slot. This patch changes the init path so that |
| we always override the add_properties() call rather than only when we |
| have power control information about the slot. |
| - fsp/lxvpd: Print more LXVPD slot information |
| |
| Useful to know since it changes the behaviour of the slot core. |
| - core/pcie-slot: Set power state from the PWRCTL flag |
| |
| For some reason we look at the power control indicator and use that to |
| determine if the slot is "off" rather than the power control flag that |
| is used to power down the slot. |
| |
| While we're here change the default behaviour so that the slot is |
| assumed to be powered on if there's no slot capability, or if there's |
| no power control available. |
| - core/pci: Increase the max slot string size |
| |
| The maximum string length for the slot label / device location code in |
| the PCI summary is currently 32 characters. This results in some IBM |
| location codes being truncated due to their length, e.g. :: |
| |
| PHB#0001:02:11.0 [SWDN] SLOT=C11 x8 |
| PHB#0001:13:00.0 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C |
| PHB#0001:13:00.1 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C |
| PHB#0001:13:00.2 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C |
| PHB#0001:13:00.3 [EP ] *snip* LOC_CODE=U78D3.ND1.WZS004A-P1-C |
| |
| Which obscure the actual location of the card, and it looks bad. This |
| patch increases the maximum length of the label string to 80 characters |
| since that's the maximum length for a location code. |
| |
| |
| Since v6.3-rc3: |
| |
| - pci: Try harder to add meaningful ibm,loc-code |
| |
| We keep the existing logic of looking to the parent for the slot-label or |
| slot-location-code, but we add logic to (if all that fails) we look |
| directly for the slot-location-code (as this should give us the correct |
| loc code for things directly under the PHB), and otherwise we just look |
| for a loc-code. |
| |
| The applicable bit of PAPR here is: |
| |
| R1–12.1–1. Each instance of a hardware entity (FRU) has a platform |
| unique location code and any node in the OF |
| device tree that describes a part of a hardware entity must include the |
| “ibm,loc-code” property with a |
| value that represents the location code for that hardware entity. |
| |
| which we weren't really fully obeying at any recent (ever?) point in |
| time. Now we should do okay, at least for PCI. |
| |
| Since v6.3-rc2: |
| - core/pci: Use PHB io-base-location by default for PHB slots |
| |
| On witherspoon only the GPU slots and the three pluggable PCI slots |
| (SLOT0, 1, 2) have platform defined slot names. For builtin devices such |
| as the SATA controller or the PLX switch that fans out to the GPU slots |
| we have no location codes which some people consider an issue. |
| |
| This patch address the problem by making the ibm,slot-location-code for |
| the root port device default to the ibm,io-base-location-code which is |
| typically the location code for the system itself. |
| |
| e.g. :: |
| |
| pciex@600c3c0100000/ibm,loc-code |
| "UOPWR.0000000-Node0-Proc0" |
| |
| pciex@600c3c0100000/pci@0/ibm,loc-code |
| "UOPWR.0000000-Node0-Proc0" |
| |
| pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code |
| "UOPWR.0000000-Node0" |
| |
| The PHB node, and the root complex nodes have a loc code of the |
| processor they are attached to, while the usb-xhci device under the |
| root port has a location code of the system itself. |
| |
| - hw/phb4: Read ibm,loc-code from PBCQ node |
| |
| On P9 the PBCQs are subdivided by stacks which implement the PCI Express |
| logic. When phb4 was forked from phb3 most of the properties that were |
| in the pbcq node moved into the stack node, but ibm,loc-code was not one |
| of them. This patch fixes the phb4 init sequence to read the base |
| location code from the PBCQ node (parent of the stack node) rather than |
| the stack node itself. |
| |
| |
| .. _skiboot-6.3-OpenCAPI: |
| |
| OpenCAPI |
| -------- |
| - npu2/hw-procedures: Fix parallel zcal for opencapi |
| |
| For opencapi, we currently do impedance calibration when initializing |
| the PHY for the device, which could run in parallel if we have |
| multiple opencapi devices. But if 2 devices are on the same |
| obus, the 2 calibration sequences could overlap, which likely yields |
| bad results and is useless anyway since it only needs to be done once |
| per obus. |
| |
| This patch splits the opencapi PHY reset in 2 parts: |
| |
| - a 'init' part called serially at boot. That's when zcal is done. If |
| we have 2 devices on the same socket, the zcal won't be redone, |
| since we're called serially and we'll see it has already be done for |
| the obus |
| - a 'reset' part called during fundamental reset as a prereq for link |
| training. It does the PHY setup for a set of lanes and the dccal. |
| |
| The PHY team confirmed there's no dependency between zcal and the |
| other reset steps and it can be moved earlier. |
| - npu2-hw-procedures: Fix zcal in mixed opencapi and nvlink mode |
| |
| The zcal procedure needs to be run once per obus. We keep track of |
| which obus is already calibrated in an array indexed by the obus |
| number. However, the obus number is inferred from the brick index, |
| which works well for nvlink but not for opencapi. |
| |
| Create an obus_index() function, which, from a device, returns the |
| correct obus index, irrespective of the device type. |
| - npu2-opencapi: Fix adapter reset when using 2 adapters |
| |
| If two opencapi adapters are on the same obus, we may try to train the |
| two links in parallel at boot time, when all the PCI links are being |
| trained. Both links use the same i2c controller to handle the reset |
| signal, so some care is needed to make sure resetting one doesn't |
| interfere with the reset of the other. We need to keep track of the |
| current state of the i2c controller (and use locking). |
| |
| This went mostly unnoticed as you need to have 2 opencapi cards on the |
| same socket and links tended to train anyway because of the retries. |
| - npu2-opencapi: Extend delay after releasing reset on adapter |
| |
| Give more time to the FPGA to process the reset signal. The previous |
| delay, 5ms, is too short for newer adapters with bigger FPGAs. Extend |
| it to 250ms. |
| Ultimately, that delay will likely end up being added to the opencapi |
| specification, but we are not there yet. |
| - npu2-opencapi: ODL should be in reset when enabled |
| |
| We haven't hit any problem so far, but from the ODL designer, the ODL |
| should be in reset when it is enabled. |
| |
| The ODL remains in reset until we start a fundamental reset to |
| initiate link training. We still assert and deassert the ODL reset |
| signal as part of the normal procedure just before training the |
| link. Asserting is therefore useless at boot, since the ODL is already |
| in reset, but we keep it as it's only a scom write and it's needed |
| when we reset/retrain from the OS. |
| - npu2-opencapi: Keep ODL and adapter in reset at the same time |
| |
| Split the function to assert and deassert the reset signal on the ODL, |
| so that we can keep the ODL in reset while we reset the adapter, |
| therefore having a window where both sides are in reset. |
| |
| It is actually not required with our current DLx at boot time, but I |
| need to split the ODL reset function for the following patch and it |
| will become useful/required later when we introduce resetting an |
| opencapi link from the OS. |
| - npu2-opencapi: Setup perf counters to detect CRC errors |
| |
| It's possible to set up performance counters for the PLL to detect |
| various conditions for the links in nvlink or opencapi mode. Since |
| those counters are currently unused, let's configure them when an obus |
| is in opencapi mode to detect CRC errors on the link. Each link has |
| two counters: |
| - CRC error detected by the host |
| - CRC error detected by the DLx (NAK received by the host) |
| |
| We also dump the counters shortly after the link trains, but they can |
| be read multiple times through cronus, pdbg or linux. The counters are |
| configured to be reset after each read. |
| |
| Since v6.3-rc1: |
| |
| - opal/hmi: Never trust a cow! |
| |
| With opencapi, it's fairly common to trigger HMIs during AFU |
| development on the FPGA, by not replying in time to an NPU command, |
| for example. So shift the blame reported by that cow to avoid crowding |
| my mailbox. |
| - hw/npu2: Dump (more) npu2 registers on link error and HMIs |
| |
| We were already logging some NPU registers during an HMI. This patch |
| cleans up a bit how it is done and separates what is global from what |
| is specific to nvlink or opencapi. |
| |
| Since we can now receive an error interrupt when an opencapi link goes |
| down unexpectedly, we also dump the NPU state but we limit it to the |
| registers of the brick which hit the error. |
| |
| The list of registers to dump was worked out with the hw team to |
| allow for proper debugging. For each register, we print the name as |
| found in the NPU workbook, the scom address and the register value. |
| - hw/npu2: Report errors to the OS if an OpenCAPI brick is fenced |
| |
| Now that the NPU may report interrupts due to the link going down |
| unexpectedly, report those errors to the OS when queried by the |
| 'next_error' PHB callback. |
| |
| The hardware doesn't support recovery of the link when it goes down |
| unexpectedly. So we report the PHB as dead, so that the OS can log the |
| proper message, notify the drivers and take the devices down. |
| - hw/npu2: Fix OpenCAPI PE assignment |
| |
| When we support mixing NVLink and OpenCAPI devices on the same NPU, we're |
| going to have to share the same range of 16 PE numbers between NVLink and |
| OpenCAPI PHBs. |
| |
| For OpenCAPI devices, PE assignment is only significant for determining |
| which System Interrupt Log register is used for a particular brick - unlike |
| NVLink, it doesn't play any role in determining how links are fenced. |
| |
| Split the PE range into a lower half which is used for NVLink, and an upper |
| half that is used for OpenCAPI, with a fixed PE number assigned per brick. |
| |
| As the PE assignment for OpenCAPI devices is fixed, set the PE once |
| during device init and then ignore calls to the set_pe() operation. |
| |
| - opal-api: Reserve 2 OPAL API calls for future OpenCAPI LPC use |
| |
| OpenCAPI Lowest Point of Coherency (LPC) memory is going to require |
| some extra OPAL calls to set up NPU BARs. These calls will most likely be |
| called OPAL_NPU_LPC_ALLOC and OPAL_NPU_LPC_RELEASE, we're not quite ready |
| to upstream that code yet though. |
| |
| |
| |
| NVLINK2 |
| ------- |
| - npu2: Allow ATSD for LPAR other than 0 |
| |
| Each XTS MMIO ATSD# register is accompanied by another register - |
| XTS MMIO ATSD0 LPARID# - which controls LPID filtering for ATSD |
| transactions. |
| |
| When a host system passes a GPU through to a guest, we need to enable |
| some ATSD for an LPAR. At the moment the host assigns one ATSD to |
| a NVLink bridge and this maps it to an LPAR when GPU is assigned to |
| the LPAR. The link number is used for an ATSD index. |
| |
| ATSD6&7 stay mapped to the host (LPAR=0) all the time which seems to be |
| acceptable price for the simplicity. |
| - npu2: Add XTS_BDF_MAP wildcard refcount |
| |
| Currently PID wildcard is programmed into the NPU once and never cleared |
| up. This works for the bare metal as MSR does not change while the host |
| OS is running. |
| |
| However with the device virtualization, we need to keep track of wildcard |
| entries use and clear them up before switching a GPU from a host to |
| a guest or vice versa. |
| |
| This adds refcount to a NPU2, one counter per wildcard entry. The index |
| is a short lparid (4 bits long) which is allocated in opal_npu_map_lpar() |
| and should be smaller than NPU2_XTS_BDF_MAP_SIZE (defined as 16). |
| |
| Since v6.3-rc2: |
| - npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default |
| |
| V100 GPUs are known to violate NVLink2 protocol in some cases (one is when |
| memory was accessed by the CPU and they by GPU using so called block |
| linear mapping) and issue double probes to NPU which can cope with this |
| problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO |
| snarfing a cp_m") is not set in the CQ_SM Misc Config register #0. |
| If the bit is set (which is the case today), NPU issues the machine |
| check stop. |
| |
| The snarfing feature is designed to detect 2 probes in flight and combine |
| them into one. |
| |
| This adds a new "opal-npu2-snarf-cpm" nvram variable which controls |
| CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check |
| stop from happening. |
| |
| This disables snarfing by default as otherwise a broken GPU driver can |
| crash the entire box even when a GPU is passed through to a guest. |
| This provides a dial to allow regression tests (might be useful for |
| a bare metal). To enable snarfing, the user needs to run: :: |
| |
| sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable |
| |
| and reboot the host system. |
| |
| - hw/npu2: Show name of opencapi error interrupts |
| |
| |
| Debugging and simulation |
| ------------------------ |
| |
| - external/mambo: Error out if kernel is too large |
| |
| If you're trying to boot a gigantic kernel in mambo (which you can |
| reproduce by building a kernel with CONFIG_MODULES=n) you'll get |
| misleading errors like: :: |
| |
| WARNING: 0: (0): [0:0]: Invalid/unsupported instr 0x00000000[INVALID] |
| WARNING: 0: (0): PC(EA): 0x0000000030000010 PC(RA):0x0000000030000010 MSR: 0x9000000000000000 LR: 0x0000000000000000 |
| WARNING: 0: (0): numInstructions = 0 |
| WARNING: 1: (1): [0:0]: Invalid/unsupported instr 0x00000000[INVALID] |
| WARNING: 1: (1): PC(EA): 0x0000000000000E40 PC(RA):0x0000000000000E40 MSR: 0x9000000000000000 LR: 0x0000000000000000 |
| WARNING: 1: (1): numInstructions = 1 |
| WARNING: 1: (1): Interrupt to 0x0000000000000E40 from 0x0000000000000E40 |
| INFO: 1: (2): ** Execution stopped: Continuous Interrupt, Instruction caused exception, ** |
| |
| So add an error to skiboot.tcl to warn the user before this happens. |
| Making PAYLOAD_ADDR further back is one way to do this but if there's a |
| less gross way to generally work around this very niche problem, I can |
| suggest that instead. |
| - external/mambo: Populate kernel-base-address in the DT |
| |
| skiboot.tcl defines PAYLOAD_ADDR as 0x20000000, which is the default in |
| skiboot. This is also the default in skiboot unless kernel-base-address |
| is set in the device tree. |
| |
| If you change PAYLOAD_ADDR to something else for mambo, skiboot won't |
| see it because it doesn't set that DT property, so fix it so that it does. |
| - external/mambo: allow CPU targeting for most debug utils |
| |
| Debug util functions target CPU 0:0:0 by default Some can be |
| overidden explicitly per invocation, and others can't at all. |
| Even for those that can be overidden, it is a pain to type |
| them out when you're debugging a particular thread. |
| |
| Provide a new 'target' function that allows the default CPU |
| target to be changed. Wire that up that default to all other utils. |
| Provide a new 'S' step command which only steps the target CPU. |
| - qemu: bt device isn't always hanging off / |
| |
| Just use the normal for_each_compatible instead. |
| |
| Otherwise in the qemu model as executed by op-test, |
| we wouldn't go down the astbmc_init() path, thus not having flash. |
| - devicetree: Add p9-simics.dts |
| |
| Add a p9-based devicetree that's suitable for use with Simics. |
| - devicetree: Move power9-phb4.dts |
| |
| Clean up the formatting of power9-phb4.dts and move it to |
| external/devicetree/p9.dts. This sets us up to include it as the basis |
| for other trees. |
| - devicetree: Add nx node to power9-phb4.dts |
| |
| A (non-qemu) p9 without an nx node will assert in p9_darn_init(): :: |
| |
| dt_for_each_compatible(dt_root, nx, "ibm,power9-nx") |
| break; |
| if (!nx) { |
| if (!dt_node_is_compatible(dt_root, "qemu,powernv")) |
| assert(nx); |
| return; |
| } |
| |
| Since NX is this essential, add it to the device tree. |
| - devicetree: Fix typo in power9-phb4.dts |
| |
| Change "impi" to "ipmi". |
| - devicetree: Fix syntax error in power9-phb4.dts |
| |
| Remove the extra space causing this: :: |
| |
| Error: power9-phb4.dts:156.15-16 syntax error |
| FATAL ERROR: Unable to parse input tree |
| - core/init: enable machine check on secondaries |
| |
| Secondary CPUs currently run with MSR[ME]=0 during boot, whih means |
| if they take a machine check, the system will checkstop. |
| |
| Enable ME where possible and allow them to print registers. |
| |
| Utilities |
| --------- |
| - pflash: Don't try update RO ToC |
| |
| In the future it's likely the ToC will be marked as read-only. Don't |
| error out by assuming its writable. |
| - pflash: Support encoding/decoding ECC'd partitions |
| |
| With the new --ecc option, pflash can add/remove ECC when |
| reading/writing flash partitions protected by ECC. |
| |
| This is *not* flawless with current PNORs out in the wild though, as |
| they do not typically fill the whole partition with valid ECC data, so |
| you have to know how big the valid ECC'd data is and specify the size |
| manually. Note that for some partitions this is pratically impossible |
| without knowing the details of the content of the partition. |
| |
| A future patch is likely to introduce an option to "stop reading data |
| when ECC starts failing and assume everything is okay rather than error |
| out" to support reading the "valid" data from existing PNOR images. |
| |
| Since v6.3-rc2: |
| |
| - opal-prd: Fix memory leak in is-fsp-system check |
| - opal-prd: Check malloc return value |