Multidisk panic due to failed disks caused by faulty SAS port
Applies to
Issue
- During ONTAP upgrade, Node-02 panics due to multidisk panic and partner performs takeover
[NODE-02: splog_main: mgr.stack.string:notice]: Panic string: aggr aggr1: raid volfsm, fatal multi-disk error.. Raid type - raid_dp Group name plex0/rg0 state RECONS. 12 disks failed in the group. Disk 0a.04.0
[NODE-02: splog_main: mgr.stack.proc:notice]: Panic in process: config_thread- Disks appear fine on node that performed takeover
- When giveback is performed, Node-02 panics again
- Observed SAS port instability on port 0b - link flapping and only one PHY online
[NODE-02: pmcsas_timeout_0: sas.adapter.debug:info]: params: {'debug_string': 'Level 0 timeout on virtual device: Hard resetting PHY: 0b.03.99 (0xfffff8077b99a040,0x12,0/0)', 'adapterName': '0a'}
[NODE-02: pmcsas_timeout_0: sas.adapter.debug:info]: params: {'debug_string': 'Level 0 timeout on virtual device: Hard resetting PHY: 0b.02.99 (0xfffff8077b9a4040,0x12,0/0)', 'adapterName': '0a'}
[NODE-02: pmcsas_timeout_0: sas.adapter.debug:info]: params: {'debug_string': 'Level 0 timeout on virtual device: Hard resetting PHY: 0b.01.99 (0xfffff8077b99e040,0x12,0/0)', 'adapterName': '0a'}
[NODE-02: rc: sas.adapter.offlining:info]: Offlining SAS adapter 0b.
[NODE-02: scsi_cmdblk_strthr_admin: scsi.cmd.adapterHardwareErrorEMSOnly:error]: Unknown device 0b.01.99: Adapter detected hardware error: HA status 0x6: cdb 0x12.
- See large number of PHY changes on the disks connected to this port as well as power cycles on the disks
- After offlining this port, system stability is restored and node no longer panics
