본문 바로가기
Cisco UCS

[UCSM] FI6454, IOM2408 thermal problem, fan equipment inoperable

by 네오마드 2024. 2. 29.
Symptoms

 

error messages


Severity: Critical
Code: F0411
ID: 2405504
Status: None
Description: Thermal condition on chassis 13 is upper-non-recoverable
Affected Object: sys/chassis-13
Name: Equipment Chassis Thermal Threshold Non Recoverable
Cause: Thermal Problem
Type: Environmental
Acknowledged: No
Occurrences: 10
Original Severity: Critical
Previous Severity: Critical
Highest Severity: Critical

 

Problem description
After completing FI-B (Subordinate) firmware upgrade, there are thermal problem, fan 5,6,7,8 equipment-inoperable. for the above symptoms, it is a known issue and is caused by i2c delay.
FANs operate at 100%. when FAN 1-5 is removed and reinstalled, the FAN is not recognized as removed.
Workaround
I received an update that the FAN led is off, but it is working. Sounds like an I2C problem.
There is a similar bug (CSCwb27664),  which seems to have been mitigated in 4.2(3b)UCSM,4.2(2c)UCSM.
1. Re-seat all Fan modules one by one, wait 30 seconds before re-inserting and make sure reseated fan is      fully operational then you move re-seating another fan module.
2. Re-seat the IO Module (the subordinate one first). Reseating the IO module will reboot itself which will        disrupt the network traffic on one side. Please make sure all the VIFs comes online before reseating            the  other IO module.
3. Re-seat the IO Module (the primary one). Reseating the IO module will reboot itself which will disrupt          the network traffic on one side. It will come online and disconnect again couple times. 
4. When re-seating, wait 5 minutes before re-inserting the IOM. Wait for the IOM to come online and                 make  sure the servers regain redundant paths before reseating the next IOM in the chassis.

After executing the above action plan, only FAN 5 returned to normal status.  Other FANs 6 and 7 were still inoperable.

 

I searched for similar issues on community.cisco.com and found that the issue disappears after chassis decommission and chassis power recycle.
After decommissioning all servers, chassis decommission, chassis power down, wait 10 minutes and reconnect FAN6,7 not recognized. Still existed thermal problem. the issue was resolved after swapping FAN6 and 7 with fans from other chassis. the only way to resolve the issue is FAN, IOM reload, and CHASSIS POWER RECYCLE. other methods are chassis rma, FI firmware update.

1.Chasis decommission
2. Remove power cable from chassis.
3. Waiting for 10 minutes
4. Reconnect power cable into chassis
5. Reseat all server, psu, fan, iom 


Reference