What's cf_hwassist_missedKeepAlive timeout and tolerant period?
Applies to
- NetApp AFF and FAS systems
- ONTAP 9
Answer
cf_hwassist_missedKeepAlive
incident is recorded in EMS after 60 seconds, after a hw-assist packet sent.
The hw-assist packets are sent by UDP every 180 seconds, so there is no retransmission of packets if the packets are sent and not received. If a UDP packet is dropped, blocked, wedged, redirected, etc.. and a node doesn't receive it, then the node will just wait for 180 seconds, until the next one is sent.
So if a cf_hwassist_recvKeepAlive
event is showing within 120 seconds after cf_hwassist_missedKeepAlive
, it can be ignored safely.
Sat Nov 04 22:07:44 +0900 [Nodename-02: cf_hwassist: cf.hwassist.missedKeepAlive:debug]: HW-assisted takeover missing keep-alive messages from HA partner (Nodename-01).
Sat Nov 04 22:09:14 +0900 [Nodename-02: cf_hwassist: cf.hwassist.recvKeepAlive:debug]: hw_assist: Received hw_assist KeepAlive alert from partner(Nodename-01).
Additional Information
- For the cause of
cf_hwassist_missedKeepAlive
, because hw-assist configured and transmit with IP and port(default 4444) on e0M which goes through customer network environment, nearly every instance of this type of failure, is due to network dropped packets. - Check the hwassist-health-check-interval by command.
aff200-2n-dal-1::> storage failover show -fields hwassist,hwassist-partner-ip,hwassist-partner-port,hwassist-health-check-interval,hwassist-retry-count,hwassist-status
node hwassist hwassist-partner-ip hwassist-partner-port hwassist-health-check-interval hwassist-retry-count hwassist-status
------------- -------- ------------------- --------------------- ------------------------------ -------------------- ---------------
aff200-dal-1a true 10.128.227.184 4444 180 2 active
aff200-dal-1b true 10.128.227.183 4444 180 2 active
2 entries were displayed.