Failed NSM100 leads to Checksum Errors and Disk Redundancy Failures
Applies to
- NSM224 shelf
- Disk checksum errors during SCRUB
Issue
- Multiple checksum errors in different disks.
[node_name: raidio_thread: raid_rg_scrub_cksum_err_1:notice]: params: {'disk_rpm': 'N/A', 'vendor': 'NETAPP ', 'firmware_revision': 'NA51', 'shelf': '24', 'disk_info': 'Disk /aggr_name/plex0/rg0/e0c.24.0.14P1 Shelf 24 Bay 14 [NETAPP X4010WBORA1T9NTE NA51] (...)
[node_name: raidio_thread: raid_rg_scrub_cksum_err_1:notice]: params: {'disk_rpm': 'N/A', 'vendor': 'NETAPP ', 'firmware_revision': 'NA51', 'shelf': '24', 'disk_info': 'Disk /aggr_name/plex0/rg0/e0c.24.0.7P1 Shelf 24 Bay 7 [NETAPP X4010WBORA1T9NTE NA51] (...)
[node_name: raidio_thread: raid_rg_scrub_cksum_err_1:notice]: params: {'disk_rpm': 'N/A', 'vendor': 'NETAPP ', 'firmware_revision': 'NA51', 'shelf': '24', 'disk_info': 'Disk /aggr_name/plex0/rg0/e0d.24.3.12P1 Shelf 24 Bay 12 [NETAPP X4010WBORA1T9NTE NA51] (...)
[node_name: raidio_thread: raid_cksum_verify_error_file_1:notice]: params: {'firmware_revision': 'NA51', [...], 'disk_info': 'Disk /aggr_name/plex0/rg0/e0c.24.0.15P1 Shelf 24 Bay 15 [NETAPP X4010WBORA1T9NTE NA51] [...], 'error': 'checksum computation mismatched', 'model': 'X4010WBORA1T9NTE', 'ino_type': ''}
-
Shelf NSM100 module reports errors related to sensors, connectivity and hardware components. Examples:
[node_name: scsi_cmdblk_strthr_admin: scsi.cmd.notReadyConditionEMSOnly:debug]: Enclosure services device 0x.24.1.99L0: Device returns not yet ready: CDB 0x1c: Sense Data SCSI:not ready - (0x2 - 0x35 0x2 0x0)(0).
[node_name: scsi_cmdblk_strthr_admin: scsi.cmd.mcc.lunmgr.io.error:debug]: Disk device S/N 22323T800648 - CDB 0x28:0b652ef8:0008 - (scsi error: command aborted) - Sense Data SCSI:no sense - (0x0 - 0x0 0x0 0x0)(DT 594). (HA status 0x15) - (out_status_flags 0x8)
[node_name: scsi_cmdblk_strthr_admin: scsi.cmd.mcc.lunmgr.io.error:debug]: Disk device S/N 22323T800065 - CDB 0x9a:0000000014c1c180:0005:002c - (scsi error: command aborted) - Sense Data SCSI:no sense - (0x0 - 0x0 0x0 0x0)(DT 966). (HA status 0x15) - (out_status_flags 0x8)
[node_name: dsa_worker3: ses.status.temperatureWarning:alert]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 29 C (84 F). This element is on the unknown location.
[node_name: dsa_worker3: ses.status.electronicsWarn:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x environmental monitoring warning for SES electronics 2: communication error. ; enclosure services hardware failed This element is on the rear of the shelf at the bottom, on module B.
[node_name: dsa_worker3: ses.status.ModuleWarn:alert]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x PCI switch warning for PCI Switch 2: communication error. This element is on the rear of the shelf at the bottom, on module B.
[node_name: dsa_worker3: ses.status.ACPWarn:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x ACP Processor warning for shelf ACP processor 2: communication error. ; Alternate Control Path hardware failed e B.
[node_name: dsa_worker3: ses.status.battery.error:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x battery failure error for Coin Battery 2: not installed or hardware failure. This element is on the rear of the shelf, in bottom module (B).
[node_name: dsa_worker3: ses.status.etherConn.warn:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x Ethernet connector warning for port e0a: cannot communicate with connector. This element is on the rear of the shelf at the bottom, on module B.
[node_name: dsa_worker3: ses.status.etherConn.warn:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x Ethernet connector warning for port e0b: cannot communicate with connector. This element is on the rear of the shelf at the bottom, on module B.
[node_name: dsa_worker3: ses.status.dimm.error:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x DIMM failure for Dimm Element 5: not installed or failed. This element is on the DIMM slot 1 in the bottom shelf module (B).
[node_name: dsa_worker3: ses.status.dimm.error:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x DIMM failure for Dimm Element 6: not installed or failed. This element is on the DIMM slot 2 in the bottom shelf module (B).
[node_name: dsa_worker3: ses.status.dimm.error:error]: NS224NSM100 (S/N SHJHU1234567890) shelf 24 on channel 0x DIMM failure for Dimm Element 7: not installed or failed. This element is on the DIMM slot 3 in the bottom shelf module (B).
-
The issue remains after ONTAP and NSM100 Firmware update.
-
The issue reamins after the NSM100 module re-seat