What is the difference between SP-Heartbeat and hwassist Keep Alive?
Applies to
- AFF systems
- ASA systems
- FAS systems
- BMC (Baseboard Management Controller)
- SP (Service Processor)
- ONTAP 9
Answer
SP-Heartbeat
- The SP-Heartbeat is a mechanism used to monitor the health and status of the SP/BMC in NetApp storage controllers.
- This functionality is designed to verify the availability and responsiveness of the SP/BMC.
- The SP/ BMC play a crucial role in managing and monitoring the hardware.
Temperature and other environmental sensor data is captured within the SP/BMC and relayed to the ONTAP operation system. - The SP-Heartbeat utilizes dedicated “internal” connectivity between the local ONTAP and the local SP/BMC.
- In the event that no heartbeat signal is received for a continuous 10-minute period, ONTAP will trigger a controlled shutdown of the local node.
- This shutdown is deliberate and serves to safeguard the NetApp controller from potential damage.
- If such an event occurs, the local Node will initiate an “environmental” auto-shutdown and the HA-Partner Node will subsequently initiate an automatic takeover.
Example:
Sat Apr 19 14:52:02 +0200 [cluster-01: spmgrd: callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED
Sat Apr 19 15:00:00 +0200 [cluster-01: statd: kern_uptime_filer_1:notice]: params: {'msg': ' 3:00pm up 358 days
Sat Apr 19 15:02:23 +0200 [cluster-01: spmgrd: callhome.sp.hbt.stopped:alert]: Call home for SP HBT STOPPED
Sat Apr 19 15:04:47 +0200 [cluster-01: env_mgr: sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes
Sat Apr 19 15:14:47 +0200 [cluster-01: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
Sat Apr 19 15:14:47 +0200 [cluster-01: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
- The hwassist Keep-Alive (Hardware assisted Keep Alive) is an external communication via the customers Ethernet network.
- The communication is established between the Cluster-Nodes Mgmt-Port and the HA-Partners SP/BMC.
- Typically, this communication utilizes TCP Port 4444 and serves as a vital mechanism for ensuring cluster availability.
- In a high-availability configuration, a NetApp storage controller uses regular status checks to monitor the well-being of its HA-Partner.
- If the hwassist takeover feature is not enabled and a failure arises on a NetApp storage controller, the HA-Partner Node will relay on mailbox disk communication only.
- It confirms the missing responses and initiates the takeover.
- By default, initiating such a takeover might take up to 15 seconds after the failure has occurred.
- The hwassist takeover feature enhances the process by using a Node’s SP/BMC to detect failures and start a takeover process more quickly.
- If the hwassist takeover feature is enabled, the failover detection time is less than a second.
- SP/BMC monitors the local system for various failures, such as Power Loss, Power Cycle, L2 Watchdog reset, POST Error, Node shutdown etc.
- If a failure is detected, then the SP/BMC immediately sends an alert to the HA-Partner Node in the form of a SNMP trap.
- Upon receiving the SNMP trap, the HA-Partner extracts the alert message from SNMP trap message and performs the appropriate action, such as initiating a takeover.
- hwassist takeover is enabled by default on systems that use remote management (SP/BMC).
- Users can view the current status and configuration of this feature with the following command:
::>storage failover hwassist show - If this feature is disabled, a possible takeover due to a HA-Partner Nodes unresponsiveness will take more time, but will still function.
- The hwassist Keep-Alive feature is not mandatory, but highly recommended to be enabled.
Additional Information
