Node down with multiple DISK "scsi.cmd.pastTimeToLive:error"

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 140

Visibility:: Public

Votes:: 0

Category:: fas-systems

Specialty:: hw

Last Updated:

Applies to

FAS 2820
ONTAP 9
Internal shelf

Issue

Node down with multiple disk scsi.cmd.pastTimeToLive:error errors.

[?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.0: request failed after try #1: cdb 0x8a:000000046cd85e00:00000200. [?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.0: request failed after try #1: cdb 0x8a:000000047237f760:00000008. [?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.0: request failed after try #1: cdb 0x8f:000000046c3c7e00:00000400. ... [?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.8: request failed after try #1: cdb 0x88:000000047237ef90:00000008.

In partner node HA Group Notification (CONTROLLER TAKEOVER COMPLETE AUTOMATIC - Communiction Error) ALERT.
- The following ems log is detected.

[?] Sat Dec 28 08:48:01 +0900 [node02: cf_main: cf.fsm.takeover.mdp:alert]: Failover monitor: takeover attempted after multi-disk failure on partner

Shelf IOM port state shows NO SIGNAL

Timestamp: Sat Jan 4 08:33:20 JST 2025 Shelf name: 0c.shelf0 Channel: 0c Module: A Shelf id: 0 Shelf UUID: 50:0a:09:80:08:6f:fb:24 Shelf S/N: SHJSG2418000037 Term switch: N/A Shelf state: ONLINE Module state: OK Partial Path Link Invalid Running Loss Phy CRC Phy Disk Port Timeout Rate DWord Disparity Dword Reset Error Change Id State Value (ms) (Gb/s) Count Count Count Problem Count Count -------------------------------------------------------------------------------------------- [HST0/P0:0] NO SIGNAL 7 NA 0 0 0 0 0 974 [HST1/P0:1] NO SIGNAL 7 NA 1299 1298 0 0 0 974 [HST2/P0:2] NO SIGNAL 7 NA 310 307 0 0 0 974 [HST3/P0:3] NO SIGNAL 7 NA 85 81 0 0 0 974 [HST4/P1:0] OK 7 12.0 0 0 0 0 0 3 [HST5/P1:1] OK 7 12.0 0 0 0 0 0 3 [HST6/P1:2] OK 7 12.0 0 0 0 0 0 3

Multiple drives are not read by the node and the aggregate fails due to multi-disk error:
Mon Jun 02 10:17:22 +0700 [node-02: config_thread: raid.vol.failed:notice]: Aggregate aggr1_n2: Failed due to multi-disk error. Mon Jun 02 10:17:23 +0700 [node-02: config_thread: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. aggr aggr1_n2: raid volfsm, fatal multi-disk error.. Raid type - raid_dp Group name plex0/rg0 state DOUBLEDEGRADED. 1 disk failed in the group. Disk 0a.00.2P1 Shelf 0 Bay 2 [NETAPP X336_TTCRE04TA07 NA04] S/N [Y3F0A2XXXXXX] UID [6000039C:E82AC314:500A0981:00000001:00000000:00000000:00000000:00000000:00000000:00000000] error: disk failed..
The node goes down due tomulti-disk failure Mon Jun 02 10:17:23 +0700 [node-02: cf_main: cf.fsm.takeover.mdp:alert]: Failover monitor: takeover attempted after multi-disk failure on partner