UAOnline, Banner, Zoom & ELMO Authentication Outage
Incident Report for OIT Services
Postmortem

Event Occurrence: October 19, 2020 12:57pm to 6:44pm

Background

Authentication services which use UA username and password are configured to use UA’s Active Directory(AD) infrastructure. UAOnline, ELMO, Zoom, Banner, People, and several others either directly or indirectly authenticate off of UA AD. UA AD has regional based servers in Fairbanks as well as other locations throughout the state. Services in Fairbanks are set to authenticate off of the closest regional services by specifying either a load balanced instance of AD or using a specific server. UA AD is also used for storing account related data that is utilized by many applications as a data repository.

Break Down of the Problem

On October 19th at 1:17pm, one of the Fairbanks UA AD servers displayed high CPU load. Services configured to connect to the host directly or were connected through the load balancer started to see slow down and ldap errors. It was determined that the host should be restarted to attempt to clear the issue, and while it restarted some services were configured to point to a secondary service. This was completed at approximately 2:51pm. Over the next 2 hours, the secondary instance started to exhibit the same high CPU load and a large number of connections were identified. The second server was rebooted and load was redistributed to the first server and a third one. Both of these other servers again exhibited the same problem. Connection issues continued to be noticed. At 6:20pm a technician noticed that restarting the service for people.alaska.edu resulted in the usage of one of the servers dropping closing to 90% of CPU load. Soon after both of the front end servers for people.alaska.edu were stopped and authentication services appeared to be working as expected. Post analysis determined the load for this application was 56 times higher than the average use for the past 4 days.

Target State / Goal 

Authentication services for UA should be fault tolerant and available 24x7 to facilitate services that UA provides and subscribes. 

Root Cause Analysis 

At 12:57pm a large number of queries for the phonebook service people.alaska.edu started from outside the university by some sort of large scale harvesting attempt of data. The requests were generally a first name trying to search in the student fork for the directory. These are the least optimal query for serving and began to stack up on the service and the UA AD server. This load from the phonebook service migrated between UA AD Fairbanks servers as mitigation attempts were made. The excessive number of connections intermittently exhausted UA AD servers so they provided intermittent responses during this period.  

Develop Countermeasures 

  • Temporarily leave the people.alaska.edu service down
  • Evaluate how to mitigate the impacts of a large scale harvesting for this application.

Implementation of Countermeasures

  • October 19, 2020 - Temporarily prevent use of this feature until the second counter measure can be implemented.
  • October 20, 2020 - Discuss and determine future countermeasures for the people.alaska.edu application’s use of the UA AD server.

Follow Up / Review

  • Temporarily leave the people.alaska.edu service down - Completed October 19, 2020. 
  • Determine the future countermeasures. - Follow up October 30, 2020
Posted Oct 20, 2020 - 13:34 AKDT

Resolved
The specific issues was verified shortly after this incident entered the monitoring state.
Posted Oct 20, 2020 - 13:32 AKDT
Monitoring
We've implemented a fix to address the login errors and the only non-operational service is the public employee directory, people.alaska.edu. We'll continue monitoring it throughout the night, and more information will be published here once that service is restored.
Posted Oct 19, 2020 - 19:01 AKDT
Update
We are still investigating reports of errors with multiple critical UA services. Thank you for waiting as we continue fixing this issue.
Posted Oct 19, 2020 - 18:20 AKDT
Investigating
We are currently investigating reports of login issues with critical UA services. Thank you for your patience as we look into this.
Posted Oct 19, 2020 - 17:09 AKDT
This incident affected: Accounts & Accesses, Banner (UAOnline, Banner 9), and Zoom (Zoom Meetings).