Problem Impact Analysis
Event Occurrence: 10/27/2020 5:10 AM - 10:30 AM
The NetApp Enterprise Storage Service provides storage volumes for many services at the University of Alaska. It is used for Windows File Shares, Network File System(NFS) volumes for many services including Banner and Blackboard Learn, and VMFS volumes for VMWare workloads in Fairbanks. It also provides a method for snapshots and backups. Domain Name Services (DNS) resolve host names to IP addresses. A fully qualified domain name (FQDN) represents hostname and domain for a server.
Starting at around 4:50 AM various servers that relied on NFS volumes configured in the NetApp storage started receiving permission denied errors when attempting to access the NFS volume. The volumes that were inaccessible had only FQDNs listed on the volume’s NFS Export-Policies. Export-Policies with client IPs were unaffected. Also, when looking to troubleshoot, technicians were unable to login to ONTAP System Manager with domain credentials. CIF Shares under \\uastora.apps.ad.alaska.edu\ were also not allowing users to access files. All of these services depend upon DNS resolution external to the NetApp servers to map hostnames to IPs for network communication. Authentication services also need DNS. DNS resolution was determined to not be functioning around 9:30am. Technicians identified that DNS attempts were going over an unexpected network on the NetApp and took corrective action to return service, including disabling a network interface and flushing name services cache. This ultimately corrected the issue around 10:30am.
NetApp volumes should be highly available and fault tolerant with no unplanned downtime for the clients utilizing the service.
On 10/26/2020, a new Logical Interface (LIF) was added to the NetApp’s network configuration. The new LIF was assigned to a network that was not, yet, accessible by the NetApp. The NetApp will use the last LIF to come online for name resolution. Since the new LIF was unable to communicate on the network to which it was assigned, the NetApp lost the ability to resolve names using DNS. This was not immediately detected by the technicians configuring the new LIF because the NetApp uses caches to reduce the number of DNS lookups and logic evaluations it needs to make and these caches allowed the NetApp to continue to function normally for a period of time.
Starting around 4:50 AM cache entries began expiring. When an NFS client tries to access a share, the NetApp consults an export policy to determine whether to allow access to the client. The NetApp ignored the entries in the export policy which used a server name that it was no longer able to resolve to a network address. Once all entries that granted a client access to a share were eliminated from an export policy, the NetApp stopped allowing the client to use the NFS share.