Skip to main content
NetApp Knowledge Base

Node experiences an abnormal reboot due to Multi Disk Failure "Permanent errors on all HA mailbox disks (while marshalling header)" after SAS adapter reset

Views:
1,415
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
hw
Last Updated:

Applies to

  • ONTAP 9
  • SAS-Adapter
  • NS224
  • 9.13.1 to 9.13.1P1
  • 9.12.1 to 9.12.1P6
  • 9.9.1P2 to 9.9.1P14

Issue

  • System may reboot unexpectedly without any panic string
  • Node goes down due to multidisk panic followed by 'I/O to disk is pending'error on all mailbox disks
  • Takeover and giveback complete without further intervention
  • System lost access to multiple disks, leading to a reboot

Example:

================ Log #1 start time Tue Jul 18 06:07:53 2023
mbx_inst_header_marshal:Error writing to all mailbox disk. mbx_sequencNo= 84496746
================ Log #1 end time Tue Jul 18 06:07:53 2023
================ Log #2 start time Tue Jul 18 06:08:13 2023
BIOS Version: 11.

  • Partner node reports missing disks:

Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.0P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.2P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.4P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_diskOpsManager: fmmb_disk_io_replyFailed_1:debug]: params: {'disk_name': '0b.00.6P2', 'uuid': '6XXXXXXX:9XXXXXXX:5XXXXXXX:0XXXXXXX:', 'status': 'I/O to disk is pending', 'operation': 'MBX_DISKIO_WRITE_1A', 'side': 'Local', 'handle': '0x0', 'flag': '0x16', 'io_state': '1'}
Thu Jul 27 03:17:36 +0800 [Node-01: fmmbx_instanceWorker: cf.multidisk.fatalProblem:debug]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).

  • No panic strings are present during the takeover and giveback operations
  • SAS adapter reset is detected, leading to the "missing" shelves and disks:

[node_name: pmcsas_asyncd_0: sas.adapter.reset:debug]: Resetting SAS adapter 0a.
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0a', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0b', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0c', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0d', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 0: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 1: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 2: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 3: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: fmmbx_instanceWorker: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).

  • No errors observed or reported on the SAS adapter prior to the reset
  • NFS requests are not being served correctly prior to panic and failover

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.