I've experienced POWER_SYS_FLT:Sensor Failure error several times using cisco ucs B200 M5
Symptoms
1.1 error messages
Affected object: sys/chassis-x/blade-x/health-led
Description: sys/chassis-x/blade-6/health-led shows error. Reason POWER_SYS_FLT:Sensor Failure Asserted
Affected object:sys/chassis-x/blade-x/board
Description: Motherboard of server x/x(service profile:org-root/ls-xxxx-x-x) power: failed
Cause:power-problem
Affected object:sys/chassis-x/blade-x
Description:Server x/x(service profile:org-root/ls-xxxx-x-x) oper state:inoperable
Cause:power-problem
After decommission the blade, reack the server, server doesn't complete discovery.
Progress Status: 7%.
Remote Invocation Result: End Point Protocol Error
Remote Invocation Error Code:1002
Remote Invocation Description: Unable to change server power state-MC Error(-20): Management controller cannot or failed in processing request
Workaround
2.1 blade power status check
From UCSM CLI shell, connect to cimc of the blade and verify the blade power status using power command
- ssh FI-IP-ADDR
- connect cimc X
- power
Failure Scenario # 1
OP:[ status ]
Power-State: [ on ]
VDD-Power-Good: [ inactive ]
Power-On-Fail: [ active ]
Power-Ctrl-Lock: [ unlocked ]
Power-System-Status: [ Good ]
Front-Panel Power Button: [ Enabled ]
Front-Panel Reset Button: [ Enabled ]
OP-CCODE:[ Success ]
Failure Scenario #2
OP:[ status ]
Power-State: [ off ]
VDD-Power-Good: [ inactive ]
Power-On-Fail: [ inactive ]
Power-Ctrl-Lock: [ permanent lock ] <<<----------------
Power-System-Status: [ Bad ] <<<---------------
Front-Panel Power Button: [ Disabled ]
Front-Panel Reset Button: [ Disabled ]
OP-CCODE:[ Success ]
For me, I matched failure scenario #2
2.2 Sel log check
Sel.log#
CIMC | Platform alert POWER_ON_FAIL #0xde | Predictive Failure asserted | Asserted
power-on-fail.hist inside the tmp/techsupport_pidXXXX/CIMCX_TechSupport-nvram.tar.gz)
2.3 Reset slot
2.3.1 Navigate to Equipment > Chassis X > Server Y > General > Server Maintenance > Decommission > Ok.
2.3.2 FI-A/B# reset slot x/y
For Example #Chassis2-Server 1 is impacted.
FI-A# reset slot 2/1
Wait for 30-40 seconds after running the above command
2.3.3 reacknowledge the server.
We've tried to reset slot. it can't hit the issue.
2.4 RMA motherboard & CPU2
CISCO TAC recommends replacing the motherboard for the above issue.
Above symptoms persisted even after replacing the motherboard.
CISCO TAC suggests doing a minimum configuration test.
1.CPU1 + memory A1+ VIC
Discovery runs fine after minimal configuration test.
After the above test, we can see that there is a problem with CPU2
2.I was able to resolve the issue after replacing CPU2
Reference:
'Cisco UCS' 카테고리의 다른 글
[UCS]VMware's Server Virtualization and Cisco's Virtualized Networking Technology (0) | 2024.03.01 |
---|---|
[UCS] vmnic received packets dropped (0) | 2024.03.01 |
[UCSM] FI6454, IOM2408 thermal problem, fan equipment inoperable (0) | 2024.02.29 |
[UCSM] error accessing shared-storage (0) | 2024.02.27 |