RHEL 7.9 host experiencing long I/O stalls on Lustre filesystem

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 288

Visibility:: Public

Votes:: 0

Category:: e-series-systems

Specialty:: esg

Last Updated:

Applies to

RHEL 7.9
lustre
corosync
pacemaker
E5700
SANtricity OS 11.70.1R1, 11.70.2

Issue

RedHat Enterprise Linux 7.9 host experiencing >120 seconds I/O stalls on Lustre filesystem, causing pacemaker/corosync to trigger NMI (non-maskable interrupt).

Host is showing a large amount of repeating Recovered Error in messages or syslog host log files:

1653449345 2022 May 25 03:29:05 hostname kern info kernel [ 5080.869325] sd 0:0:0:3: [sdc] tag#11 Sense Key : Recovered Error [current]

1653449345 2022 May 25 03:29:05 hostname kern info kernel [ 5080.869327] sd 0:0:0:3: [sdc] tag#11 Add. Sense: Select or reselect failure