Blackboard Outage
Incident Report for OIT Services
Postmortem

Problem Impact Analysis

Event Occurrence: 10/27/2020 5:10 AM - 10:30 AM

Background

The NetApp Enterprise Storage Service provides storage volumes for many services at the University of Alaska.  It is used for Windows File Shares, Network File System(NFS) volumes for many services including Banner and Blackboard Learn, and VMFS volumes for VMWare workloads in Fairbanks.  It also provides a method for snapshots and backups. Domain Name Services (DNS) resolve host names to IP addresses. A fully qualified domain name (FQDN) represents hostname and domain for a server. 

Break Down of the Problem

Starting at around 4:50 AM various servers that relied on NFS volumes configured in the NetApp storage started receiving permission denied errors when attempting to access the NFS volume.  The volumes that were inaccessible had only FQDNs listed on the volume’s NFS Export-Policies. Export-Policies with client IPs were unaffected.  Also, when looking to troubleshoot, technicians were unable to login to ONTAP System Manager with domain credentials.  CIF Shares under \\uastora.apps.ad.alaska.edu\ were also not allowing users to access files.  All of these services depend upon DNS resolution external to the NetApp servers to map hostnames to IPs for network communication. Authentication services also need DNS. DNS resolution was determined to not be functioning around 9:30am. Technicians identified that DNS attempts were going over an unexpected network on the NetApp and took corrective action to return service, including disabling a network interface and flushing name services cache. This ultimately corrected the issue around 10:30am.

Target State / Goal 

NetApp volumes should be highly available and fault tolerant with no unplanned downtime for the clients utilizing the service.

Root Cause Analysis 

On 10/26/2020, a new Logical Interface (LIF) was added to the NetApp’s network configuration.  The new LIF was assigned to a network that was not, yet, accessible by the NetApp.  The NetApp will use the last LIF to come online for name resolution.  Since the new LIF was unable to communicate on the network to which it was assigned, the NetApp lost the ability to resolve names using DNS.  This was not immediately detected by the technicians configuring the new LIF because the NetApp uses caches to reduce the number of DNS lookups and logic evaluations it needs to make and these caches allowed the NetApp to continue to function normally for a period of time.

Starting around 4:50 AM cache entries began expiring.  When an NFS client tries to access a share, the NetApp consults an export policy to determine whether to allow access to the client.  The NetApp ignored the entries in the export policy which used a server name that it was no longer able to resolve to a network address.  Once all entries that granted a client access to a share were eliminated from an export policy, the NetApp stopped allowing the client to use the NFS share.  

Develop Countermeasures

  • Document standard configuration for advertising volumes. 
  • Investigate how we can monitor and alert on the name server resolution status
  • Consult NetApp on best practices for Export Policies related to IP vs. named entries in host file
  • For services that did not alert correctly (Blackboard Learn), also implement a status check for the mount’s ability to read and write data.

Implementation of Countermeasures

  • January 29, 2021 - Document standard configuration for advertising volumes. 
  • November 27, 2020 - Investigate how we can monitor and alert on the name server resolution status
  • November 4, 2020 - Consult NetApp on best practices for Export Policies related to IP vs. named entries in host file
  • November 4, 2020 - Specific to UA Learn implement a status check for the mount’s ability to read and write data.

Follow Up / Review

  • February 1, 2021 - Document standard configuration for advertising volumes. 
  • November 30, 2020 - Investigate how we can monitor and alert on the name server resolution status
  • November 6, 2020 - Consult NetApp on best practices for Export Policies related to IP vs. named entries in host file
  • November 6, 2020 - Specific to UA Learn implement a status check for the mount’s ability to read and write data.
Posted Oct 28, 2020 - 10:38 AKDT

Resolved
The NetApps/Blackboard incident has been resolved.
Posted Oct 27, 2020 - 12:01 AKDT
Update
We are continuing to work on a fix for this issue.
Posted Oct 27, 2020 - 10:34 AKDT
Update
Blackboard access has been restored. We are still working on implementing a fix to restore access to NetApps and related services.
Posted Oct 27, 2020 - 10:34 AKDT
Update
The NetApps issue causing the Blackboard outage is also affecting access to shared drives on the uastora server. We are still working on implementing a fix.
Posted Oct 27, 2020 - 08:56 AKDT
Identified
The Blackboard outage cause has been identified and and a fix is being implemented.
Posted Oct 27, 2020 - 08:13 AKDT
Investigating
Users navigating to the UA Blackboard page (classes.alaska.edu) are receiving an HTTP Status 508 error. We are currently investigating the cause of this issue.
Posted Oct 27, 2020 - 05:00 AKDT
This incident affected: UA Blackboard Learn (UA Blackboard Learn Web Application) and Accounts & Accesses.