Skip to main content
NetApp Knowledge Base

Upgrade was paused due to Node not in connected state after giveback

Views:
83
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
san
Last Updated:

Applies to

  • Ontap 9
  • ANDU
  • FC Protocol

Issue

  • Upgrade was paused due to Node-2 is not in "connected" state after giveback.
  • This was ANDU Upgrade and Upgrade was being performed from 9.9.1p15 to 9.13.1p9.
     
  • The first node to complete the upgrade and boot back online was Node:2.
Thu Aug 22 18:27:42 +1000 [Node1: cf_giveback: cf.fm.givebackComplete:notice]: Failover monitor: giveback completed (Giveback of node 622)
Thu Aug 22 18:27:44 +1000 [Node2: cf_fastTimeout: cf.fm.localFwTransition:debug]: params: {'prevstate': 'SF_MBWAIT', 'newstate': 'SF_CLUSTERWAIT', 'progresscounter': '0'}
 
  • Once the CFO giveback was completed, the expectation is for the node to stabilize within 15 minutes.
  • However, the node fails to stabilize within the 15 minute timer , which resulted in the upgrade getting paused and manual intervention was required to verify the status.
  • From logs we could see that the node "SF_CLUSTERWAIT" state for more than 15 minutes
 
Thu Aug 22 18:53:26 +1000 [Node2: cf_fastTimeout: cf.fm.localFwTransition:debug]: params: {'prevstate': 'SF_CLUSTERWAIT', 'newstate': 'SF_UP', 'progresscounter': '0'}
 
  • Hence as soon as the 15min timer reached the end, the upgrade was paused stating that the "Node-2 is not in connected state" as seen below
Thu Aug 22 18:42:39 +1000 [Node1: upgrademgr: upgrademgr.update.pausedErr:error]: The automated update of the cluster has been paused due to the following reason:  Node "Node1": Error: {Node "Node2" is not in "connected" state after giveback.}, Action: {Use the "storage failover show" command to verify that node "Node2" is in the "connected" state.}.
Thu Aug 22 18:42:39 +1000 [Node1: notifyd: callhome.andu.pausederr:alert]: params: {'subject': 'AUTOMATED NDU PAUSED ON NODE: Node1', 'epoch': 'c364126a-55b4-4545-aaff-d326556e1961'}
 
  • The same can be seen from the CLI session output as well
Cluster::> cluster image show-update-progress
 
                                             Estimated         Elapsed
Update Phase         Status                   Duration        Duration
-------------------- ----------------- --------------- ---------------
Pre-update checks    completed                00:10:00        00:01:20
ONTAP updates        paused-on-error          03:36:00        01:18:37
 
Details:
 
Node name            Status            Status Description
-------------------- ----------------- --------------------------------------
Node1       waiting
Node2       failed            Error: Node "Node2" is not
                                       in "connected" state after giveback.
                                       Action: Use the "storage failover
                                       show" command to verify that node
                                       "Node2" is in the
                                       "connected" state.
 
 
 
Status: Paused - An error occurred in "ONTAP updates" phase. The update cannot continue until the error has been resolved. Resolve all issues, then use the "cluster image resume-update" command to resume the update.
 
  • As the ANDU was paused due to the 15min timer not being met and the version to be different within the HA PAIR the takeover was disabled due to version mismatch and required manual intervention.
 
Cluster:*> storage failover show
                                                 Takeover
Node           Partner        Possible State Description
-------------- -------------- -------- -------------------------------------
Node1 Node2 false    Connected to Node2,
                                       Takeover is not possible: The
                                       version of software running on each
                                       node of the SFO pair is incompatible,
                                       NVRAM log not synchronized
Node2 Node1 false    Connected to Node1,
                                       Takeover is not possible: The
                                       version of software running on each
                                       node of the SFO pair is incompatible,
                                       NVRAM log not synchronized
 
  • At this point is when the Node:1 was taken over by Node:2 using bypass set to true  to clear the version mismatch issue
Thu Aug 22 19:13:37 +1000 [Node2: cf_takeover: cf.fm.takeoverComplete:notice]: Failover monitor: takeover completed
 
  • And the node-1 booted with 9.13 image post which the upgrade was resumed smoothly with the remaining nodes.
 
00000019.00006cb4 172c9fc3 Thu Aug 22 2024 19:34:03 +10:00 [kern_audit:info:3110] 8003e80000000813:8003e800000008d8 :: undcdsa-fas600:ssh :: 10.140.6.65:48572 :: cluster:admin :: cluster image resume-update :: Pending
00000019.00006cbe 172c9fc3 Thu Aug 22 2024 19:34:03 +10:00 [kern_audit:info:3110] 8003e80000000813:8003e800000008d8 :: undcdsa-fas600:ssh :: 10.140.6.65:48572 :: cluster:admin :: cluster image resume-update :: Success
 
  • SCSI blade took more than 15 minutes to come up and which is why upgrade was paused and hence manual intervention was needed in ANDU.
  • SLUR & CTRAN and SCSI blade proposal stuck events were reported in EMS log.
  • There were retired proposals as it did not receive response from GC(Group coordinator) and, we could see that Vifmgr is complaining about large packet loss b/w cluster lif Node2_clus2

 

[?]  Thu Aug 22 18:38:53 +1000 [Node:2: ctran_gm_1: ctran.gm.proposal.retired:debug]: Cluster group member (name=b200b270-21b6-11e8-be38-00a098b5d127, id=200) has retired proposal trans_id=2033390325#1.


[?]  Thu Aug 22 18:39:05 +1000 [Node:2:: vifmgr: vifmgr.cluscheck.droppedlarge:alert]: Partial packet loss when pinging from cluster lif Node2_clus2 (node Node2) to cluster lif Node2_clus1 (node Node2).
[?]  Thu Aug 22 18:39:18 +1000 [Node:2:: vifmgr: vifmgr.cluscheck.droppedlarge:alert]: Partial packet loss when pinging from cluster lif Node2_clus2 (node Node2) to cluster lif Node2_clus2 (node Node2).
[?]  Thu Aug 22 18:39:31 +1000 [Node:2:: vifmgr: vifmgr.cluscheck.droppedlarge:alert]: Partial packet loss when pinging from cluster lif Node2_clus2 (node Node2) to cluster lif Node1_clus1 (node Node1).


[?]  Thu Aug 22 18:39:32 +1000 [Node:2: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module ICS_SCSIT instance 0 was stored in /etc/log/rastrace/ICS_SCSIT_0_20240822_18:39:32:772626.dmp.
[?]  Thu Aug 22 18:39:44 +1000 [Node:2: vifmgr: vifmgr.cluscheck.droppedlarge:alert]: Partial packet loss when pinging from cluster lif Node2_clus2 (node Node2) to cluster lif Node1_clus2 (node Node1).
[?]  Thu Aug 22 18:39:46 +1000 [Node:2: scsit_lu_0: scsiblade.lu.resync.start:notice]: Resynchronization of LUN 80BTh?Lxxxxx has started. The LUN is not accessible on LIFs on this node until resynchronization is complete and the LUN recovers automatically.
[?]  Thu Aug 22 18:39:52 +1000 [Node:2: ctran_gm_2: ctran.gm.proposal.retired:debug]: Cluster group member (name=24d4f96b-6a06-11e8-a1a3-00a098c9f6ed, id=113) has retired proposal trans_id=2033390386#1.
[?]  Thu Aug 22 18:39:52 +1000 [Node:2: ctran_gm_2: ctran.gm.proposal.retired:debug]: Cluster group member (name=000000070000000080325f2000000007, id=127) has retired proposal trans_id=2033390391#0.
[?]  Thu Aug 22 18:39:55 +1000 [Node:2: vifmgr: vifmgr.cluscheck.droppedlarge:alert]: Partial packet loss when pinging from cluster lif Node2_clus2 (node Node2) to cluster lif Node2_clus1 (node Node2).
[?]  Thu Aug 22 18:39:57 +1000 [Node:2: scsit_lu_0: scsiblade.lu.resync.timeout:alert]: Resynchronization of LUN 80BTh?Lvxxxx on node Node:2 was not completed in 90 seconds. LUN is not accessible on LIFs on this node. Perform a takeover followed by a giveback of this node.
[?]  Thu Aug 22 18:39:57 +1000 [Node:2: scsit_lu_0: callhome.sblade.lu.resync.to:EMERGENCY]: Call home for SCSIBLADE LU RESYNC TIMEOUT

[?] Thu Aug 22 19:10:42 +1000 [Node:2: scsit_lu_0: scsiblade.lu.int.rst.start:debug]: Internal reset started on LUN 80BTh?Lvxxxx for reason: metadata read in progress, internal or PR IN/OUT metadata operation timed out.

 

 

  • After the takeover of Node1,  all the SLUR events were stopped and upgrade was completed successfully on node 1 & 2.
  • The LIF Node2_clus2 is hosted on port e8a of node-2, where there were lot of CRC errors reported  (within 12 mins) after the node came online after upgrade.

-- interface  e8a  (0 hours, 12 minutes, 8 seconds) --
RECEIVE
Total frames:      150k | Frames/second:     207  | Total bytes:     16459k
Bytes/second:    22610  | Total errors:    27703  | Errors/minute:    2283
 Total discards:      0  | Discards/minute:     0  | Multi/broadcast:   284
 Non-primary u/c:     0  | CRC errors:      26226  | Runt frames:         0
 Fragment:            0  | Long frames:         0  | Jabber:              0
 Length errors:       0  | No buffer:           0  | Xon:                 0
 Xoff:                0  | Pause:               0  | Jumbo:               8
 Noproto:             0  | Error symbol:        0  | Illegal symbol:   1477
 Bus overruns:        0  | Queue drops:         0  | LRO segments:      119k
LRO bytes:       12857k | LRO6 segments:       0  | LRO6 bytes:          0
 Bad UDP cksum:       0  | Bad UDP6 cksum:      0  | Bad TCP cksum:       0
 Bad TCP6 cksum:      0  | Mcast v6 solicit:    0  | Lagg errors:         0
 Lacp errors:         0  | Lacp PDU errors:     0
TRANSMIT
Total frames:      134k | Frames/second:     185  | Total bytes:     23398k
Bytes/second:    32140  | Total errors:        0  | Errors/minute:       0
 Total discards:      0  | Queue overflow:      0  | Multi/broadcast:    87

 

  • 'hm.alert.raised:alert' or 'vifmgr.cluscheck.hwerrors:alertEMS events were reported as well, indicating that port has become degraded due to hardware errors.

 [?]  Thu Aug 22 18:40:25 +1000 [NetApp: vifmgr: vifmgr.cluscheck.hwerrors:alert]: Port e8a on node NetApp is reporting a high number (at least 1 per 1000 packets) of observed hardware errors (CRC, length, alignment, dropped).

 [?]  Thu Aug 22 18:37:38 +1000 [NetApp: nphmd: hm.alert.raised:alert]: Alert Id = NodeIfInErrorsWarnAlert , Alerting Resource = NetApp/e8a raised by monitor controller

 

  • CTRAN uses cluster ports to communicate with all the nodes in the cluster.
  • If there are any issues on those ports, then CTRAN will also get affected, which is what was seen here as well.

 

  • On VMware end, APD issue was reported.
  • VMs encountered reboot and ESXi host generated below error messages and post which host went into unmanaged/freezed mode.

2024-08-23T01:01:41.184Z cpu20:2099778)WARNING: FSAts: 1593: Denying reservation access on an ATS-only vol 'FAS_PRD'
2024-08-23T01:01:41.184Z cpu20:2099778)WARNING: HBX: 2420: ATS-Only VMFS volume 'FAS_PRD' is not mounted. This host does not support ATS, or ATS initialization failed.
2024-08-23T01:01:41.184Z cpu20:2099778)WARNING: HBX: 2440: Failed to initialize VMFS distributed locking on volume 5f14c52d-bffaf06d-6cb3-0a8c00001xxx: Not supported
2024-08-23T01:01:41.184Z cpu20:2099778)WARNING: Fil3: 1539: Failed to reserve volume xxx 28 1 5f14c52d bffaf06d 8c0a6cb3 7b150000 0 0 0 0 0 0 0
2024-08-23T01:01:41.205Z cpu0:2099778)WARNING: FSAts: 1593: Denying reservation access on an ATS-only vol 'FAS_PRD'
2024-08-23T01:01:41.205Z cpu0:2099778)WARNING: HBX: 2420: ATS-Only VMFS volume 'FAS_PRD' is not mounted. This host does not support ATS, or ATS initialization failed.
2024-08-23T01:01:41.244Z cpu0:2099778)WARNING: HBX: 2440: Failed to initialize VMFS distributed locking on volume 5f14c52d-bffaf06d-6cb3-0a8c00001xxx: Not supported
 
 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.