Brocade switch reboot/hung with no reason listed
Applies to
Issue
- Brocade switch reboots and lists no reason in the output of `
errdump -a
`.
2021/01/13-22:27:54, [TS-1008], 166442, FID 128, WARNING, NetApp01, 10.237.4.57 Clock Server used instead of 10.237.4.58.
2021/01/14-10:04:27, [HAM-1004], 166443, CHASSIS, INFO, NetApp01_6510, Processor rebooted - Reset.
2021/01/14-10:04:41, [FV-1001], 166444, CHASSIS, INFO, NetApp01, Flow Vision daemon initialized.
2021/01/14-10:05:09, [FCR-1069], 166445, FID 128, INFO, NetApp01, The FC Routing service is enabled.
2021/11/18-11:30:24:252641, [EM-1034], 130201/128550, CHASSIS, ERROR, Brocade6510, PS 2 set to faulty, rc=2000e., OID:0x44200000, power.c, line: 759, comp:emd, ltime:2021/11/18-11:30:24:244273
2021/11/18-11:30:24:252850, [MAPS-1003], 130202/128551, FID 128, WARNING, CDCSTSPRD1-A, Power Supply 2, Condition=ALL_PS(PS_STATE==FAULTY), Current Value:[PS_STATE, FAULTY], RuleName=defALL_PSPS_STATE_FAULTY, Dashboard Category=Fru Health., raslogAction.c, line: 101, comp:md, ltime:2021/11/18-11:30:24:251077
2021/11/18-11:30:37:538708, [MAPS-1021], 130203/128552, FID 128, WARNING, CDCSTSPRD1-A, RuleName=defCHASSISBAD_PWR_CRIT, Condition=CHASSIS(BAD_PWR>=1), Obj:Chassis [ BAD_PWR,1] has contributed to switch status CRITICAL., switchStatusPol, line: 233, comp:md, ltime:2021/11/18-11:30:37:536626
2021/11 18-11:30:37:538962, [MAPS-1020], 130204/128553, FID 128, WARNING, CDCSTSPRD1-A, Switch wide status has changed from HEALTHY to CRITICAL., switchStatusPol, line: 273, comp:md, ltime:2021/11/18-11:30:37:537818
2021/11/18-11:30:38:013143, [EM-1037], 130205/128554, CHASSIS, INFO, Brocade6510, PS 2 is no longer faulted., OID:0x44200000, poll.c, line: 1212, comp:emd, ltime:2021/11/18-11:30:38:012512
2021/11/18-11:30:39:564773, [LOG-1000], 130207/128556, CHASSIS, INFO, Brocade6510, Previous message repeated 2 time(s)., OID:0x44200000, poll.c, line: 1212, comp:emd, ltime:2021/11/18-11:30:39:032329
2021/11/18-11:30:39:564992, [MAPS-1003], 130208/128557, FID 128, WARNING, CDCSTSPRD1-A, Power Supply 2, Condition=ALL_PS(PS_STATE==ON), Current Value:[PS_STATE, ON], RuleName=defALL_PSPS_STATE_ON, Dashboard Category=Fru Health., raslogAction.c, line: 101, comp:md, ltime:2021/11/18-11:30:39:563035
- From the switch
errdump
log, we could see that the reboot is due to power failure. - We could see the various slots being powered up which is indicative of the unit having lost power completely.
2023/02/23-01:41:31:690735, [HAM-1004], 1443417/146856, SLOT 1 | CHASSIS, INFO, switch, Processor rebooted - Reset., reboot.c, line: 117, comp:hamd, ltime:2023/02/23-01:41:22:642968
2023/02/23-01:41:34:983319, [PLAT-5093], 1443441/0, SLOT 1 | CHASSIS, INFO, switch, Core Blade Power fail Slot 7 pwr_fail 0x30 pwr_fail_msk 0x33 , modular_ctrl.c, line: 3132, comp:emd, ltime:2023/02/23-01:41:34:983148
2023/02/23-01:41:35:162737, [PLAT-5093], 1443442/0, SLOT 1 | CHASSIS, INFO, switch, Core Blade Power fail Slot 8 pwr_fail 0x30 pwr_fail_msk 0x33 , modular_ctrl.c, line: 3132, comp:emd, ltime:2023/02/23-01:41:35:162583
2023/02/23-01:41:36:170508, [PLAT-8056], 1443443/0, SLOT 1 | CHASSIS, INFO, switch, PS Current Status: 0x2/0x2/0x2/0x2, unit = 1, allegiance.c, line: 1552, comp:emd, ltime:2023/02/23-01:41:36:170402
- No such error messages logged prior to the PSUs going down , they have just halt and we could just see the reboot messages as they come back online.
-
There aren't any core files that points to a software problem or other clues to point at a hardware issue for either unit and these events happened at roughly the same time on both units, which looks like a power event.
-
The close proximity of the time they resumed power indicates that the issue happened due to an external event like power loss.