Skip to main content
NetApp Knowledge Base

What is the difference between SP-Heartbeat and hwassist Keep Alive?

Views:
330
Visibility:
Public
Votes:
1
Category:
ontap-9
Specialty:
CORE
Last Updated:

Applies to

  • AFF systems
  • ASA systems
  • FAS systems
  • BMC (Baseboard Management Controller)
  • SP (Service Processor)
  • ONTAP 9

Answer

SP-Heartbeat

  • The SP-Heartbeat is a mechanism used to monitor the health and status of the SP/BMC  in NetApp storage controllers.
  • This functionality is designed to verify the availability and responsiveness of the SP/BMC.
  • The SP/ BMC play a crucial role in managing and monitoring the hardware.
    Temperature and other environmental sensor data is captured within the SP/BMC and relayed to the ONTAP operation system.
  • The SP-Heartbeat utilizes  dedicated “internal” connectivity between the local ONTAP and the local SP/BMC.
  • In the event that no heartbeat signal is received for a continuous 10-minute period, ONTAP will trigger a controlled shutdown of the local node. 
  • This shutdown is deliberate and serves to safeguard the NetApp controller from potential damage.
  • If such an event occurs, the local Node will initiate an “environmental” auto-shutdown and the HA-Partner Node will subsequently initiate an automatic takeover.

Example:

Sat Apr 19 14:52:02 +0200 [cluster-01: spmgrd: callhome.sp.hbt.missed:notice]: Call home for SP HBT  MISSED

Sat Apr 19 15:00:00 +0200 [cluster-01: statd: kern_uptime_filer_1:notice]: params: {'msg': '  3:00pm up 358 days

Sat Apr 19 15:02:23 +0200 [cluster-01: spmgrd: callhome.sp.hbt.stopped:alert]: Call home for SP HBT  STOPPED

Sat Apr 19 15:04:47 +0200 [cluster-01: env_mgr: sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes

Sat Apr 19 15:14:47 +0200 [cluster-01: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)

Sat Apr 19 15:14:47 +0200 [cluster-01: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -

 
 
hwassist Keep-Alive
  • The hwassist Keep-Alive (Hardware assisted Keep Alive) is an external communication via the customers Ethernet network.
  • The communication is established between the Cluster-Nodes Mgmt-Port and the HA-Partners SP/BMC.
  • Typically, this communication utilizes TCP Port 4444 and serves as a vital mechanism for ensuring cluster availability.
  • In a high-availability configuration, a NetApp storage controller uses regular status checks to monitor the well-being of its HA-Partner.
  • If the hwassist takeover feature is not enabled and a failure arises on a NetApp storage controller, the HA-Partner Node will relay on mailbox disk communication only. 
  • It confirms the missing responses and initiates the takeover. 
  • By default, initiating such a  takeover might take up to 15 seconds after the failure has occurred.
  • The hwassist takeover feature enhances the process by using a Node’s SP/BMC to detect failures and  start a takeover process more quickly.
  • If the hwassist takeover feature is enabled, the failover detection time is less than a second.
  • SP/BMC monitors the local system for various failures, such as Power Loss, Power Cycle, L2 Watchdog reset, POST Error, Node shutdown etc.  
  • If a failure is detected, then the SP/BMC immediately sends an alert to the HA-Partner Node in the form of a SNMP trap.
  • Upon receiving the SNMP trap, the HA-Partner extracts the alert message from SNMP trap message and performs the appropriate action, such as initiating a takeover.
  • hwassist takeover is enabled by default on systems that use remote management (SP/BMC).
  • Users can view the current status and configuration of this feature with the following command:  ::>storage failover hwassist show 
  • If this feature is disabled, a possible takeover due to a HA-Partner Nodes unresponsiveness will take more time, but will still function.
  • The hwassist Keep-Alive feature is not mandatory, but highly recommended to be enabled.

Additional Information

additionalInformation_text

 

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.