본문 바로가기
Cisco UCS

[UCS B200M5] POWER_SYS_FLT:Sensor Failure

by 네오마드 2024. 2. 28.

I've experienced POWER_SYS_FLT:Sensor Failure error several times using cisco ucs B200 M5

Symptoms

 1.1 error messages

Affected object: sys/chassis-x/blade-x/health-led

Description: sys/chassis-x/blade-6/health-led shows error. Reason POWER_SYS_FLT:Sensor Failure Asserted

 

Affected object:sys/chassis-x/blade-x/board

Description: Motherboard of server x/x(service profile:org-root/ls-xxxx-x-x) power: failed

Cause:power-problem

 

Affected object:sys/chassis-x/blade-x

Description:Server x/x(service profile:org-root/ls-xxxx-x-x) oper state:inoperable

Cause:power-problem

 

After decommission the blade, reack the server, server doesn't complete discovery.

Progress Status: 7%. 

Remote Invocation Result: End Point Protocol Error

Remote Invocation Error Code:1002

Remote Invocation Description: Unable to change server power state-MC Error(-20): Management controller cannot or failed in processing request 

 

Workaround

2.1 blade power status check

From UCSM CLI shell, connect to cimc of the blade and verify the blade power status using power  command

  • ssh FI-IP-ADDR
  • connect cimc X
  • power
Failure Scenario # 1
OP:[ status ]
Power-State:              [ on ]
VDD-Power-Good:           [ inactive ]  
Power-On-Fail:            [ active ]       
Power-Ctrl-Lock:          [ unlocked ]
Power-System-Status:      [ Good ]
Front-Panel Power Button: [ Enabled ]
Front-Panel Reset Button: [ Enabled ]
OP-CCODE:[ Success ]
Failure Scenario #2 
OP:[ status ]
Power-State:              [ off ]
VDD-Power-Good:           [ inactive ]
Power-On-Fail:            [ inactive ]
Power-Ctrl-Lock:          [ permanent lock ]  <<<----------------
Power-System-Status:      [ Bad ]                <<<---------------
Front-Panel Power Button: [ Disabled ]
Front-Panel Reset Button: [ Disabled ]
OP-CCODE:[ Success ]

For me, I matched  failure scenario #2

 

2.2 Sel log check

Sel.log#

CIMC | Platform alert POWER_ON_FAIL #0xde | Predictive Failure asserted | Asserted

power-on-fail.hist inside the tmp/techsupport_pidXXXX/CIMCX_TechSupport-nvram.tar.gz)

 

2.3 Reset slot

2.3.1 Navigate to Equipment > Chassis X > Server Y > General > Server Maintenance > Decommission > Ok.

2.3.2 FI-A/B# reset slot x/y

For Example #Chassis2-Server 1 is impacted.

FI-A# reset slot 2/1

Wait for 30-40 seconds after running the above command

2.3.3 reacknowledge the server.

We've tried to reset slot. it can't hit the issue.

 

2.4 RMA motherboard & CPU2

CISCO TAC recommends replacing the motherboard for the above issue. 

Above symptoms persisted even after replacing the motherboard.

CISCO TAC suggests doing a minimum configuration test.

1.CPU1 + memory A1+ VIC 

Discovery runs fine after minimal configuration test.

After the above test, we can see that there is a problem with CPU2

2.I was able to resolve the issue after replacing CPU2

 

Reference:

https://www.cisco.com/c/en/us/support/docs/servers-unified-computing/unified-computing-system/214047-troubleshooting-ucs-blade-discovery-issu.html