Controller status becomes degraded due to CriticalCECCCountMemErrAlert
Applies to
- FAS9000
- ONTAP 9.1P7
Issue
- Controller shows degraded status as below:
cluster::> system controller showController Name System ID Serial Number Model Status------------------------- ------------- ----------------- -------- -----------cluster-01 512345785 701234567892 FAS9000 degradedcluster-02 512345787 701234567893 FAS9000 ok2entries were displayed.- The following health alert is logged:
cluster::> system health alert showNode: cluster-01Resource: DIMM-15Severity: CriticalIndication Time: Thu Aug 26 04:20:47 2021Suppress: falseAcknowledge: falseProbable Cause: The DIMM has degraded, leading to memory errors.Possible Effect: Memory issues can lead to a catastrophic system panic,which can lead to data downtime on the node.Corrective Actions: 1. Contact technical support to obtain a new DIMM of the same specification.2. If possible, perform a takeover of this node and bring the node down for maintenance.3. Refer to the DIMM replacement guide for your given hardware platform to replace the DIMM.4. Bring the storage system online.cluster::> system health alert show -instanceNode: cluster-01Monitor: controllerAlert ID: CriticalCECCCountMemErrAlertAlerting Resource: DIMM-15Subsystem: MemoryIndication Time: Thu Aug 26 04:20:47 2021Perceived Severity: CriticalProbable Cause: DIMM_DegradedDescription: The DIMM has degraded, leading to memory errors.Corrective Actions: 1. Contact technical support to obtain a new DIMM of the same specification.2. If possible, perform a takeover of this node and bring the node down for maintenance.3. Refer to the DIMM replacement guide for your given hardware platform to replace the DIMM.4. Bring the storage system online.Possible Effect: Memory issues can lead to a catastrophic system panic, which can lead to data downtime on the node.Acknowledge: falseSuppress: falsePolicy: CriticalCECCCountMemErrAlertPolicyAcknowledger: -Suppressor: -Additional Information: Slot Name: DIMM-15CPU Socket: 0Channel: 0Slot number on a channel: 1Correctable ECC error count: 1075Uncorrectable ECC error count: 0Correctable ECC error Limit: 500Node Serial Number: 701234567892Node Model: FAS9000Alerting Resource Name: DIMM-15Additional Alert Tags: device