Node experiences an abnormal reboot due to Multi Disk Failure "Permanent errors on all HA mailbox disks (while marshalling header)" after SAS adapter reset
Applies to
- ONTAP 9
- SAS-Adapter
- NS224
- 9.13.1 to 9.13.1P1
- 9.12.1 to 9.12.1P6
- 9.9.1P2 to 9.9.1P14
Issue
- System may reboot unexpectedly without any panic string
- Node goes down due to multidisk panic followed by 'I/O to disk is pending'error on all mailbox disks
- Takeover and giveback complete without further intervention
- System lost access to multiple disks, leading to a reboot
Example:
================ Log #1 start time Tue Jul 18 06:07:53 2023
mbx_inst_header_marshal:Error writing to all mailbox disk. mbx_sequencNo= 84496746
================ Log #1 end time Tue Jul 18 06:07:53 2023
================ Log #2 start time Tue Jul 18 06:08:13 2023
BIOS Version: 11.
- Partner node reports missing disks:
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.0P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.2P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.4P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.6P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_instanceWorker: cf.multidisk.fatalProblem:debug]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).
- No panic strings are present during the takeover and giveback operations
- SAS adapter reset is detected, leading to the "missing" shelves and disks:
[node_name: pmcsas_asyncd_0: sas.adapter.reset:debug]: Resetting SAS adapter 0a.
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0a', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0b', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0c', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0d', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 0: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 1: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 2: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 3: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: fmmbx_instanceWorker: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).
- No errors observed or reported on the SAS adapter prior to the reset
- NFS requests are not being served correctly prior to panic and failover
