AIQ Unified Manager 'Cluster monitoring failed' or 'Cluster not reachable' alert due to network lag with ONTAP cluster
Applies to
- Active IQ Unified Manager 9.6 + (UM)
- Oncommand Unified Manager 6x/7x/9.x (UM)
Issue
Cluster monitoring failed
alert is received for random cluster at random times. However, the acquisition succeeds most times and there is no performance data gap observed.Cluster cannot be reached
email alerts are issued intermittently and become obsolete after approximately 15 minutes.
Errors similar to the bellow in
au.log
2020-03-19 06:35:07,290 ERROR [pool-3-thread-956] c.o.s.a.d.n.NetAppOCIEArchivePerformancePackage (NetAppOCIEArchivePerformancePackage.java:307) - Failed to get archive file names from zapi. java.net.ConnectException: Connection timed out (Connection timed out)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
[...]
... 20 more
Wrapped by: com.onaro.sanscreen.acquisition.framework.datasource.DataSourceErrorException: Communication problem with the cluster: <cluster_ip>
at com.onaro.sanscreen.acquisition.framework.datasource.DataSourceErrorException.createWithEnhanced(DataSourceErrorException.java:73) ~[au-framework.jar:9.6.0-2019.06.J5087]
[...]
ocumserver.log
[com.netapp.ipc.jms.OCIE_Events] OCIE JMS notification message received: {WarningCount=0, DatasourceName=x.x.x.x, DatasourceID=12,
Error0_ClusterManagementIP=x.x.x.x, PackageName=netappfoundation, TotalReportTime=569, PollStartTime=1591613772703, ErrorCount=1,
Success=false, DurationTime=23248, Error0_Message=Failed to connect to the cluster., TotalZAPITime=-1, NotificationType=PACKAGE_COMPLETED, Error0_Type=NETWORK_ACCESS_FAILURE, UpdateTime=1591613796437, Error0_Port=443, MessageType=PACKAGE_NOTIFICATION,
Error0_Zapi=service-processor-get}