Takeover due to a Poisoned Transaction Layer Packet (PTLP) event on the PCI Bus
Applies to
- ONTAP 9
- AFF/FAS Systems with internal storage
Issue
- Takeover due an Uncorrectable Machine Check Error
PANIC: Uncorrectable Machine Check Error at CPU1. SKL_IIO Error: STATUS<0xf780000000010405>(VALID,OVERFLOW,UC,EN,ADDRV,PCC,S,AR,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0x1),MCACOD(0x405))IIO Machine Check from device(s):RPT(22,0,0):ErrSrcID(CorrSrc(0x1898),UCorrSrc(0)), PLX PCIE 9797 switch on Controller, PLX PCIE 9797 switch on Controller. , ADDR(0). in process idle: cpu1 on release 9.12.1P5 (C) on Tue Feb 13 10:00:03 PST 2024
- On analysis of system logs the following is found. This points to a specific drive (14 as per output below):
"Poisoned Transaction Layer Packet (PTLP)"
This indicates an error on the link between these devices:
Dv[a824](27,0,0) in slot 14: PCI Device 144d:a824 in slot 14 on Controller
Br[9797](24,2,0): PLX PCIE 9797 switch on Controller
- Errors are being reported for that drive in EMS:
Tue Feb 13 10:00:07 -0800 [cluster: scsi_cmdblk_strthr_admin: disk.timeout.flush.start:debug]: Aggressive timeout flush started on disk 0n.14.
