I am receiving occasional Host Login Failed warnings via dfm for the cluster admin account at different times of the day. These clear themselves automatically.
Is there any way to find out where these logins are coming from?
If it was DFM itself trying to access the cluster (bad password in config for example) I wouldn't expect the alarm to clear....
OCUM version 188.8.131.5208
Clustered DataONTAP version 8.1.2P3
An example of the alarm email is below:
A Warning event at 20 Jun 18:53 GMT Daylight Time on Cluster SEN_SAN_CLU01:
Host Login Failed.
Host SEN_SAN_CLU01 user admin login failed
Click below to see the details of this event.
*** Event details follow.***
DataFabric Manager server Serial Number: 1-50-017635 Alarm Identifier: 2
Event Identifier: 9167
Event Name: Host Login Failed
Event Description: Host Login
Event Severity: Warning
Event Timestamp: 20 Jun 18:53
Source of Event
Source Identifier: 132
Source Name: SEN_SAN_CLU01
Source Type: Cluster
Source Status: Warning
My concern is with it being the admin account (cluster administrator) I'd like to identify if this is a bad password in a configuration somewhere or someone attempting to log in to the cluster.
I would not expect that to be a configuration problem within UM since the hostPassword is only entered once for the entire cluster and if works one time it should continue to work.
This might be indicative of a node configuration problem or network access to a particular node(s).
Does the warning only occur for one node (SEN_SAN_CLU01) or all of them?
I've had these alerts from two clusters in two separate datacentres at various times over the last week, which does sound like a network connectivity issue but I would've thought there would be a "host down" message in that case, not a failed login?
If there's no method of obtaining more information from the OCUM logs I'll see if I can get more info from the clusters themselves.
Using the default settings of UM 5.1, in order to produce a host down event/alert the host would need to be down for a significant amount of time. This behavior was changed in UM 5.2 under bug 614983 (no public report at this time).
OnCommand Unified Manager Core uses five different methods to identify if a host is down:
The default behavior for a host down monitor run is a ping using ICMP echo and then snmpwalk. UM will retry each method a pre-configured number of times with varying timeouts, as seen below.
While the ICMP retries and timeouts have remained the same over the 5.x code line, the SNMP timeouts were increased in UM 5.1 for 7DOT and even more for 5.1 cDOT installations.
Due to changes under bug 614983, if pingMonTimeout is set to less than or equal to 5 seconds, then the SNMP timeout for host down (pingmon) monitoring will be 5 seconds. If the pingMonTimeout is set to a value greater than 5 seconds, then the pingMonTimeout is used as the SNMP timeout. The global MonSNMPTimeout is used for all other SNMP connections. This applies to both 7DOTand cDOT versions of UM 5.2.
UM 5.0.x default values:
pingMonInterval 1 minute
UM 5.1 7DOT default values:
pingMonInterval 1 minute
UM 5.1/5.2 cDOT default values:
pingMonInterval 1 minute
Therefore, if a clustered ONTAP controller is down for less than 5 minutes, UM 5.1 will not report it as down as it would not have exceeded the first timeout value for the host down check. If the ping method is changed to to echo or http the node down event is logged.
Changing the monSNMPTimeout to the 5.0.x default value of 5 seconds allows UM to determine the host down status with the echo_snmp method. However, it is not recommend that this value be adjusted lower than the default for cDOT UM 5.1 servers as some SNMP transactions can take a few minutes to complete and should not be sent multiple times under 5 minutes.
Checking the cluster for auth failures suggests this isn't a user getting the password wrong!
SEN_SAN_CLU01::*> event status show -messagename login.auth.loginDenied
Node Message Occurs Drops Last Time
----------------- ---------------------------- ------ ----- -------------------
SEN_SAN_CLU01-02 login.auth.loginDenied 1 0 5/8/2013 13:30:33
I'll speak to the networking team!
Thanks for your assistance Kevin.
Just to close this thread off, it turns out this was due to OnCommand using HTTPS to communicate with the clusters.
Changing the host protocol to HTTP / Port 80 via the dfm command line stopped the warnings being generated.
dfm host set <CLUSTERNAME> hostAdminPort=80 hostAdminTransport=http
Thought I would add to this thread as its one of the few that helped me. I had a call open for 3 months with NetApp support on this error, we had set transport to HTTP but that did not resolve the error for us. It was only after adding a 3rd cluster to DFM we saw the new cluster error once and then no further Host Login Errors from the new cluster yet both original clusters were erroring 20 - 30 times a day.
Yesterday I removed one of the clusters from DFM, carried out a purge on DFM and added it back in and the Host Login Errors have stopped for that cluster.
I have informed NetApp and they want me to hold off removing and re-adding the last of the erroring clusters so they can get some information out of the system.
Hopefully this will help others too if they still have issues.