UA Authentication and Hosted Services Outage
Incident Report for OIT Services
Postmortem

UA Authentication/SSO System Outage

Problem Impact Analysis

Event Occurrence: 7/14/2024 11:57 AM - 7:04 PM

Background

UA experienced a major outage for approximately 7 hours during which time staff, students, and faculty could not login to business critical applications using UA SSO. These applications include UAOnline, Google Workspace, Blackboard Learn, and Canvas.  Existing SSO sessions were not impacted, however new sessions were unable to be created. Fairbanks clients and servers configured to use DHCP or the main DNS servers were also not able to resolve ad.alaska.edu DNS records for approximately 7 hours. This was not a malicious attack by a third party threat actor. The IT support teams across the MAU’s and System Office were able to restore services without the loss of any critical data.  In addition, they have identified several areas for improvement within their own internal procedures and plan to eliminate a single point of failure that exists in our network today and remove the dependency on physical servers.

Break Down of the Problem

Target State / Goal 

UA Authentication/SSO service should be available 24/7 except during pre-defined maintenance windows. The DNS zones under ad.alaska.edu should be available 24/7 and configured in a resilient manner.  Helpdesk staff and on-call staff should have the tools, permissions and training they need to be successful.  Notifications and escalations procedures should be defined and automated where possible.

Root Cause Analysis 

The main UA DNS servers, aduafns (137.229.15.5) and adswfns (137.229.15.9), perform recursive lookups of names in Active Directory’s ad.alaska.edu zone for on-site computers.  This functionality is achieved by creating a stub zone on the main UA DNS servers configured with only one authoritative Active Directory DNS server, fbk-adroot01 (137.229.5.193), from which to retrieve records, making fbk-adroot01 a single point of failure.

Sometime before 11:57 AM, fbk-adroot01 stopped responding to DNS requests for the ad.alaska.edu zone.  Specifically, a faulty retaining clip on the ethernet cable connected to fbk-adroot01 allowed the cable to pull loose, disconnecting fbk-adroot01 from the network.

For SSO, this resulted in the Shibboleth service failing to find the IPs for the domain controllers it was configured to use for authentication. Other services that were relying on the main UA DNS servers for resolving ad.alaska.edu records were also unable to resolve AD hostnames.

AD DNS was available via other domain controllers, however most Fairbanks clients and servers are configured to solely query fbk-adroot01, either through use of the main UA DNS servers or a name server configuration provided by DHCP.

  

Develop Countermeasures 

  • Inspect and possibly replace the ethernet cable that became disconnected.
  • Evaluate the current configurations used by alaska.edu DNS servers to advertise ad.alaska.edu namespace, at UAS, UAF, UAA and SO since UAA/UAS were not as severely impacted.  Identify any other single points of failure that should be addressed.  Review technical capabilities and update it so it does not depend on a single domain controller.
  • Document and train all team members (UAA, OIT, UAS, NTS) on the new DNS zone delegation configuration for ad.alaska.edu.
  • Remove /etc/hosts entries that hardcode the IPs for the Domain Controllers on the SSO servers.
  • Review and improve outage escalation procedures for major service outages
  • Monitor SSO logs and alert on-call technicians when a spike in errors occur.

Implementation of Countermeasures

Jul 15 - Inspect ethernet cable and connection.  Ensure iDrac connection for fbk-adroot1 is accessible.

Jul 15-23 - Research and Collaborate with Networking team regarding configuration options for ad.alaska.edu zone on the blue cats. Current State Diagram, Option 1, Option 2

Jul 24 - Submit proposed config change to CAB and implement changes

Jul 25 - Update documentation related to DNS <-> AD DNS and train team members.

Jul 26 - Remove /etc/hosts entries

Follow Up / Review

Jul 15 - Ethernet cable and iDrac connection - the cable that was used had a damaged retaining clip.  It has been replaced with a prefab cable with a working retaining clip.

Jul 15-16 - Initial discussions with team members regarding current configuration and potential solutions.  Determined that more testing is needed for a bigger change. 

Jul 17 - Meeting with ESC group, brought up AD DNS configuration issues discovered (too many authoritative Name Servers in AD DNS to fit in a DNS response, not all of them are reachable, unable to use standard zone forwarding with this configuration).

Jul 22 - Meeting with AD Admins to discuss AD DNS configurations and reduce the number of authoritative name servers to remove remote campuses and ensure they are all reachable.  Doug Knight (PAWS) and Frank Forque (UAA) will work on it throughout the week.

Posted Jul 31, 2024 - 10:56 AKDT

Resolved
This incident has been resolved and service has been fully restored. Thank you for your patience and enjoy the weekend!
Posted Jul 13, 2024 - 19:15 AKDT
Update
A workaround has been implemented and SSO services have been restored. Some devices and services may still be impacted until the service is fully repaired, and we're continuing to investigate the root cause.
Posted Jul 13, 2024 - 18:30 AKDT
Update
We are continuing to troubleshoot the issue and attempting to restore service. Thank you for your patience.
Posted Jul 13, 2024 - 17:18 AKDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 13, 2024 - 14:41 AKDT
Investigating
We are currently investigating reports of issues with sign-in and authentication. Thank you for your patience as we work to restore service.
Posted Jul 13, 2024 - 12:50 AKDT
This incident affected: UA Blackboard Learn (UA Blackboard Learn Web Application), Alaska.edu website (Alaska.edu Edit Server), UA Google Apps (Google Apps Single Sign On), Accounts & Accesses, Other IT Services, and Banner (UAOnline, Banner 9).