UA Authentication/SSO System Outage
Problem Impact Analysis
Event Occurrence: 7/14/2024 11:57 AM - 7:04 PM
UA experienced a major outage for approximately 7 hours during which time staff, students, and faculty could not login to business critical applications using UA SSO. These applications include UAOnline, Google Workspace, Blackboard Learn, and Canvas. Existing SSO sessions were not impacted, however new sessions were unable to be created. Fairbanks clients and servers configured to use DHCP or the main DNS servers were also not able to resolve ad.alaska.edu DNS records for approximately 7 hours. This was not a malicious attack by a third party threat actor. The IT support teams across the MAU’s and System Office were able to restore services without the loss of any critical data. In addition, they have identified several areas for improvement within their own internal procedures and plan to eliminate a single point of failure that exists in our network today and remove the dependency on physical servers.
UA Authentication/SSO service should be available 24/7 except during pre-defined maintenance windows. The DNS zones under ad.alaska.edu should be available 24/7 and configured in a resilient manner. Helpdesk staff and on-call staff should have the tools, permissions and training they need to be successful. Notifications and escalations procedures should be defined and automated where possible.
The main UA DNS servers, aduafns (137.229.15.5) and adswfns (137.229.15.9), perform recursive lookups of names in Active Directory’s ad.alaska.edu zone for on-site computers. This functionality is achieved by creating a stub zone on the main UA DNS servers configured with only one authoritative Active Directory DNS server, fbk-adroot01 (137.229.5.193), from which to retrieve records, making fbk-adroot01 a single point of failure.
Sometime before 11:57 AM, fbk-adroot01 stopped responding to DNS requests for the ad.alaska.edu zone. Specifically, a faulty retaining clip on the ethernet cable connected to fbk-adroot01 allowed the cable to pull loose, disconnecting fbk-adroot01 from the network.
For SSO, this resulted in the Shibboleth service failing to find the IPs for the domain controllers it was configured to use for authentication. Other services that were relying on the main UA DNS servers for resolving ad.alaska.edu records were also unable to resolve AD hostnames.
AD DNS was available via other domain controllers, however most Fairbanks clients and servers are configured to solely query fbk-adroot01, either through use of the main UA DNS servers or a name server configuration provided by DHCP.
Jul 15 - Inspect ethernet cable and connection. Ensure iDrac connection for fbk-adroot1 is accessible.
Jul 15-23 - Research and Collaborate with Networking team regarding configuration options for ad.alaska.edu zone on the blue cats. Current State Diagram, Option 1, Option 2
Jul 24 - Submit proposed config change to CAB and implement changes
Jul 25 - Update documentation related to DNS <-> AD DNS and train team members.
Jul 26 - Remove /etc/hosts entries
Jul 15 - Ethernet cable and iDrac connection - the cable that was used had a damaged retaining clip. It has been replaced with a prefab cable with a working retaining clip.
Jul 15-16 - Initial discussions with team members regarding current configuration and potential solutions. Determined that more testing is needed for a bigger change.
Jul 17 - Meeting with ESC group, brought up AD DNS configuration issues discovered (too many authoritative Name Servers in AD DNS to fit in a DNS response, not all of them are reachable, unable to use standard zone forwarding with this configuration).
Jul 22 - Meeting with AD Admins to discuss AD DNS configurations and reduce the number of authoritative name servers to remove remote campuses and ensure they are all reachable. Doug Knight (PAWS) and Frank Forque (UAA) will work on it throughout the week.