Why is High Write Latency observed on Solidfire volumes accessed through the FibreChannel (SFFC) target?
SolidFire Fibre Channel
What is the problem?
It has been observed that certain circumstances in the FibreChannel switch fabric between FC initiators and the Solidfire FibreChannel node can cause a condition that results in higher than expected average write I/O latency.
Who is exposed?
For a volume to be exposed to this issue, ALL of the following conditions must be true:
- The volume must be accessed through the Solidfire FibreChannel target. iSCSI targets are not impacted by this issue.
- The cluster must be running one of the following Element OS Releases:
Exposure to the issue is not evidence to confirm that the issue has been encountered. Exposure is only a prerequisite condition.
What are the symptoms?
Exposed volumes may suddenly exhibit significant decline in write performance, and sudden increase in average write latency. The write latency increase will be sustained and may become progressively worse until mitigation actions are taken.
Read latency is not impacted.
How can I recover write performance if I observe symptoms of this issue?
In order from the least impactful to most impactful mitigation actions:
- You may “flap” (link down/link up) switch ports connected to FC nodes one at a time across the entire cluster to mitigate the issue.
Example: In a two-node cluster, 8 total FC switchports attached to FC nodes need to be brought down and back up again. You may perform this action by shutting down switch ports, waiting a couple seconds, and then bringing them back online one at a time until you have reset all 8 ports. You may also pull/re-plug cables attached to FC nodes, but this is generally more difficult to execute in practice.
- A NetApp support engineer may use the customer support tunnel in conjunction with an internal mitigation script to reset FC links from the node. Mitigating the issue in this way also provides NetApp support engineers additional logging that can be used to positively confirm that the root cause of the latency issue is in fact the issue discussed in this KB.
- Rebooting FC nodes one at a time will also mitigate the issue. If this option is selected, one must take special care to make sure that any rebooted node is fully online and operational again, and that host multipath drivers have reestablished connectivity to paths through the rebooted node BEFORE any subsequent node reboots.
Note: It is important to understand that once the issue has been mitigated for a cluster, that does not mean exposure to the issue has been eliminated. Symptoms may reoccur and additional mitigations will be necessary until the cluster is upgraded to an Element software version that is not impacted by this issue.
What is the root cause of the issue?
It has been determined that a defect in the qla2xxx target driver version v10.00.00.01-k shipped with the 4.14 Linux kernel on which the SF FC implementation is based has a defect that causes SCSI TASK ABORT commands to be completed incorrectly under some circumstances. These incorrectly formatted completions cause a resource leak in firmware operating on the QLogic QLE2672 FibreChannel Adapter. Over time, resources that the FibreChannel adapter requires to operate become scarce or exhausted, resulting in inefficient or impossible allocation of resources to service SCSI Write commands. These inefficiencies result in high latency for some write commands and have a significant impact on average write performance.
Is it possible to avoid this problem without upgrading to a patched Element release?
This defect has been eliminated in element versions starting with Magnesium-12.0.
There is no way to guarantee recurrence of this issue without upgrading to an Element release that has a patch for the improper abort completions bug. There are however some sensible actions that can be undertaken to reduce abort activity from the initiators and thus delaying the onset of this issue due to leaked resources.
Ways to ensure minimal SCSI Task Aborts:
- Always make sure to follow ALL best practices recommended by NetApp support
- Make sure to disable the ESX/vmware “smartd polling”.
- Ensure that throughout the FibreChannel fabric, there are no incrementing signal or protocol errors. Any switchports which show symptoms should be diagnosed and corrected by replacing SFP’s and/or cables or deploying other mitigations that correct the incrementing error conditions.
- For Cisco switches, counters from both “show interface” and “show hardware internal fc-mac” should be considered. Cisco documents these counters here: Troubleshooting SAN Switching Issues.
- For Brocade switches, see Brocade SAN switch Porterrshow counters explaination.
Add your text here.