What is a Super Logical Unit Reset (SLUR) and how to recover from a hung condition?
Applies to
- OnTap 9.x
Answer
In clustered Data ONTAP, a Logical Unit Number (LUN) is a distributed object spanning across one or more nodes in a cluster. A Super Logical Unit Reset (SLUR) is an internal LUN reset mechanism triggered by the clustered Data ONTAP SCSI target. A SLUR is initiated internally within ONTAP in the rare event where a prior distributed operation times out. This is performed to re-initialize itself to a consistent state.
What is a SLUR?
- Super Logical Unit Reset (SLUR)
- Self-triggered within Data ONTAP
- Triggered by SCSIT when there is an inconsistency to re-initialize itself to a consistent state
- Distributed operation that can also timeout
What happens during a SLUR?
- No new members are allowed to join the LUN group
- Existing members can leave the LUN group
- Terminates all in-flight and new commands (until it completes)
- Logical Unit Cleanup
SLURs can be triggered for multiple reasons. The following EMS messages indicate each state and can be used to determine that a SLUR has occurred:
Start of SLUR:
Start of SLUR is denoted by scsiblade.lu.int.rst.start
EMS string
Wed May 27 2015 14:32:11 GMT [node-1: scsit_lu: scsiblade.lu.int.rst.start:DEBUG]: Internal reset started on LUN AvV7z?Cl-tME for reason: initiated by peer
.
End of SLUR:
End of SLUR is denoted by scsiblade.lu.int.loc.rst.end
EMS string
Wed May 27 2015 14:36:53 GMT [node-1: scsit_lu: scsiblade.lu.int.loc.rst.end:DEBUG]: Internal reset of LUN AvV7z?Cl-tME was completed on node node-1
.
SLUR Completion:
For a SLUR to be complete cluster-wide, it should be complete on all nodes. SLUR completion is denoted by the scsiblade.lu.int.rst.end EMS
string.
Wed May 27 2015 14:36:53 GMT [node-1: scsit_lu: scsiblade.lu.int.rst.end:DEBUG]: Internal reset of LUN AvV7z?Cl-tME was completed cluster-wide.
Stuck SLUR:
When a SLUR operation does not complete, the logical unit enters a hung state and is denoted by the scsiblade.lu.int.rst.hung
EMS string.
Wed May 27 2015 14:32:41 GMT [node-1: scsit_lu: scsiblade.lu.int.rst.hung:ALERT]: Access to LUN AvV7z?Cl-tME is restricted because an internal reset of the LUN was not completed in 30 seconds. Perform a takeover followed by a giveback for the following nodes: node-1
.
Each node in the cluster will emit an EMS for the SLUR start. The message string contains a reason section. The node NOT performing the SLUR will report initiated by peer.
Example: [scsit_lu: scsiblade.lu.int.rst.start:debug]: Internal reset started on LUN D1dyo]E5t9p5 for reason: initiated by peer.
The node performing the SLUR, and the one that needs to be rebooted, will state one of several reasons.
Example: [scsit_lu: scsiblade.lu.int.rst.start:debug]: Internal reset started on LUN D1dyo]E5t9p0 for reason: PR OUT bb owner died.
How to validate whether or not a hung SLUR exists:
You can check whether or not a SLUR is stuck from the command line with the following command. If the response is empty, then there are no hung SLURs:
cluster1::> event log show -messagename *scsiblade.lu*
There are no entries matching your query.
In the following example, you can see multiple messages for a single LUN. In some cases, a SLUR can complete later after a hung event. If this is the case, and no LUN access issues are present there is no need to perform a TO/GB.
cluster1::> event log show -messagename *scsiblade.lu*
Wed Jun 15 20:20:13 PDT [node-1: scsit_lu_1: scsiblade.lu.int.rst.start:debug]: Internal reset started on LUN AvV7z?Cl-tME for reason: tmr deadman timer expired.
Wed Jun 15 20:21:43 PDT [node-1: scsit_lu_0: scsiblade.lu.int.rst.hung:alert]: Access to LUN AvV7z?Cl-tME is restricted because an internal reset of the LUN was not
Wed Jun 15 20:22:39 PDT [node-1: scsit_lu_0: scsiblade.lu.int.loc.rst.end:debug]: Internal reset of LUN AvV7z?Cl-tME was completed on node node-1.
Wed Jun 15 20:22:39 PDT [node-1: scsit_lu_0: scsiblade.lu.int.rst.end:debug]: Internal reset of LUN AvV7z?Cl-tME was completed cluster-wide.
Recovering from a hung SLUR:
Note: Do not perform a takeover/giveback if a SLUR completion occurs after a hung event and no access issues are currently present with the LUN.
Note: Do not perform a takeover/giveback if it is known, or it is possible, that stuck groups also exist within the cluster. If it is unclear whether or not this might be the case, open a support case to validate whether or not it is safe to perform a takeover/giveback.
If a SLUR operation becomes unresponsive, a EMS message will give an indication as to which node should be rebooted in order to clear the stuck SLUR. In all cases seen to date, a single-node reboot is sufficient to clear a hung SLUR. The EMS message will clearly state the remedial actions to recover from the hung SLUR condition. In the example stuck SLUR EMS above, the message tells us which node needs to be rebooted to complete the stuck SLUR operation.
Note: If a root cause analysis (RCA) is required, follow the rastrace data collection in KB: How to collect data for an RCA of a SAN event that occurred in the past
before continuing with recovery. See Step 1 under the Data ONTAP section.
To resolve the issue, perform a takeover followed by a giveback for the LUN indicated in the scsiblade.lu.int.rst.hung:ALERT
EMS event. In the example above, perform a takeover/giveback of the following nodes:
node-1.
Additional Information
N/A