RHEL 7.9 host experiencing long I/O stalls on Lustre filesystem
Applies to
- RHEL 7.9
- lustre
- corosync
- pacemaker
- E5700
- SANtricity OS 11.70.1R1, 11.70.2
Issue
RedHat Enterprise Linux 7.9 host experiencing >120 seconds I/O stalls on Lustre filesystem, causing pacemaker/corosync to trigger NMI (non-maskable interrupt).
Host is showing a large amount of repeating
Recovered Error
in messages or
syslog
host log files:1653449345 2022 May 25 03:29:05 hostname kern info kernel [ 5080.869325] sd 0:0:0:3: [sdc] tag#11 Sense Key : Recovered Error [current]
1653449345 2022 May 25 03:29:05 hostname kern info kernel [ 5080.869327] sd 0:0:0:3: [sdc] tag#11 Add. Sense: Select or reselect failure