UAOnline, Banner, & ELMO Outage
Incident Report for OIT Services
Postmortem

Event Occurrence: October 12/23 2020, November 10, 2020 and prior related

Background

The Banner 9 Admin pages are used for several business functions within the university as a primary interface to the Banner ERP. Functions include finance data, payroll, and the setup of courses for coming semesters. The banner 9 admin pages are set up across multiple servers and are load balanced. The load balancer pools were set up when Banner 9 was initially launched and minimally used. 

Break Down of the Problem

Over the past several months, the Banner 9 admin pages have provided different error messages including “Service Invocation Error”, “T4CConnection error”, and various extreme loads on the frontend servers. They have been reported through both official and unofficial channels for tracking major service impact. 

Target State / Goal 

The Banner 9 Admin pages are primarily used during the business day between the hours of 7am and 10pm and as such should be available during those times unless scheduled through stakeholder representatives. Generally these administrative pages should be available 24/7 but are not required to be.

Root Cause Analysis 

The load balancer pool was determined to be set up in a way that caused the session to timeout after 3 minutes of connectivity, landing end users on a server where they had no session. This new session would require use of another database socket and potentially throw an error indicating that the session no longer existed for the individual user. This continued switching between hosts caused resource exhaustion in some cases for the database or complete server load. In some cases, mitigation attempts would remove one of the server nodes from the load balancer, sending all traffic to the same node, and result in it not being added back in which would manifest itself as a higher server load after a few days of utilization.

Develop Countermeasures 

  • Update load production load balancer configures to stay persistent on the same host

Implementation of Countermeasures

  • December 4th, 2020 - Update load balancer configurations

Follow Up / Review

  • December 8th, 2020 - Verify the impact with end users to validate the success of the change. Completed December 7th and 8th with positive feedback.
Posted Dec 09, 2020 - 11:22 AKST

Resolved
The outage impacting UAOnline, Banner, & ELMO has been resolved. A postmortem will be posted here.
Posted Oct 23, 2020 - 14:36 AKDT
Update
We are continuing to monitor for any further issues.
Posted Oct 21, 2020 - 08:37 AKDT
Monitoring
A fix has been made and we are currently monitoring the systems.
Posted Oct 20, 2020 - 13:43 AKDT
Investigating
We are currently investigating reports of outages with UAOnline, Banner, and ELMO. Thank you for your patience as we look into this.
Posted Oct 20, 2020 - 11:50 AKDT
This incident affected: Accounts & Accesses and Banner (UAOnline, Banner 9).