본문 바로가기
Cisco UCS

[UCSM] error accessing shared-storage

by 네오마드 2024. 2. 27.

 

We are currently have a problem with this annoying errors

This is main symptom one of our ucs domains for vdi.

Symptoms:

1.1 main error messages

Affected object: sys/mgmt-entity-A

Description: device chassis num., error accessing shared-storage

Type: management

Cause:device-shared-storage-error    

Highest severity: Warning.

Fault appeared and cleared after a while

 

1.2 show cluster log 

I can check the cluster state for my ucs domain. 

I find out errors for chassis

FI-A#sh cluster state

Chassis x, serial:xxxxxxxx, state:active with errors

Fabric A, chassis-seeprom local IO failure:

xxxxxxxx READ_FAILED, error: TIMEOUT, error code:10, error count:4

Warning: there are pending management I/O errors on one or more devices, failover may not complete

 

 

Workaround:

2.1

It has a limitation of hardware of having reliable communication between IOM and SEEPROM in the chassis backplane(i2c delay)

The error message indicates IOM had an issue in accessing SEEPROM in the chassis backplane.

(1) Unplug the IO module.
(2) Replug in the IO module. Make sure the module is in contact with the backplane firmly.
(3) Reboot the IO module.

2.2

1)A) If the error accessing shared-storage fault is currently in cleared state and does not raise again, do not apply the work around and do not do anything

2)B) If the error accessing shared-storage fault is raised state and is never cleared, or the fault keeps flip - comes (raised) and goes (cleared), try the following;

2.2.1 SSH to fabric interconnect

2.2.2 FI-A# connect local-mgmt 

2.2.3 FI-A(local-mgmt)# connect iom x(chassis num)

2.2.4 fex-1#show platform soft cmc thermal status | grep status

 status:                   #ACTIVE

9(i1), 13(peer_therm_status),1(not_applicable), 0 "PEER_IOM_THERM"

=> If it says PASSIVE, you need to restart this IOM (IOM 1 on fabric A)
via UCSM, if it says ACTIVE, it's the other side (IOM 2 on fabric B) that
needs to be reset.  

I need to reset IOM 2 on fabric B

 

2.2.5. To do reset, Equipment -> IO Modules -> IO Module 2 -> Reset IO Module

2.2.6 It doesn't solve the problem. you need to workaround as belows

Remove PSU1 let sit for 2 minutes replace, wait 10 seconds confirm PSU1 has power, Move to PSU2
Remove PSU2 let sit for 2 minutes replace, wait 10 seconds has power, Move to PSU3
Remove PSU3 let sit for 2 minutes replace, wait 10 seconds PSU3 has power, Move to PSU4
Remove PSU4 let sit for 2 minutes replace, wait 10 seconds PSU4 has power, Move to Fan1

Remove Fan1 let sit for 30 seconds replace, wait 10 seconds confirm Fan1 has power, Move to Fan2
Remove Fan2 let sit for 30 seconds replace, wait 10 seconds confirm Fan2 has power, Move to Fan3
Remove Fan3 let sit for 30 seconds replace, wait 10 seconds confirm Fan3 has power, Move to Fan4
Remove Fan4 let sit for 30 seconds replace, wait 10 seconds confirm Fan4 has power, Move to Fan5
Remove Fan5 let sit for 30 seconds replace, wait 10 seconds confirm Fan5 has power, Move to Fan6
Remove Fan6 let sit for 30 seconds replace, wait 10 seconds confirm Fan6 has power, Move to Fan7
Remove Fan7 let sit for 30 seconds replace, wait 10 seconds confirm Fan7 has power, Move to Fan8
Remove Fan8 let sit for 30 seconds replace, wait 10 seconds confirm Fan8 has power, Move to IO MOD1

Remove IO Mod 1 let sit for 5 minutes replace, confirm that IO MOD is UP and Running before you reseat IOMOD 2
Once IO MOD1 is Up and Running finally reseat IO MOD 2 let sit for 5 minutes, and place it back into the chassis.

This is the complete reseat process to clear the i2c bus.