VPN and Remote Campus Internet Disruptions
Incident Report for OIT Services
Postmortem

Background

The University of Alaska receives network services from ACS.  The ISP provides the long haul WAN circuits that comprise the University of Alaska’s core network from/to Fairbanks, Anchorage, Juneau, Seattle, and Portland.  The circuits from Fairbanks <-> Seattle and Anchorage <-> Portland comprise the University’s connection to Internet2, AWS, and some commodity internet sites.

Break Down of the Problem

On October 13th around 1000, one of ACS’ fibers that supply Wide Area Network (WAN) Connectivity (to Anchorage and Juneau) and Commodity Internet (CI) Connection to the University of Alaska Fairbanks Main Campus began generating faults and errors resulting in lost and delayed packets, thereby impacting internet performance for the vast majority of users at UAF and consumers of UAF and UA Statewide Fairbanks Information Technology services (including Data Center resources).

Target State / Goal 

Restore services as soon as possible.

Root Cause Analysis 

One of the two ACS fibers supplying WAN and CI to the Fairbanks Main Campus became severely impacted resulting in poor and lost performance.  Due to the nature of the issue, it was not immediately clear as to what the exact problem was initially.  The high number of errors generated caused the ACS router to become overwhelmed and because of this error reporting by the device was compromised.  ACS believed that the device needed a reboot to clear the reporting errors and what was assumed at the time to be a software bug in the ACS router.  An emergency maintenance outage was published for Oct 14th at 0100.  During this window the ACS router was rebooted, this cleared the reporting errors, but did not fix the underlying issue.  Once proper error reporting was restored, ACS engineers could more clearly see the error impacting one of the campus feeding fibers.  In order to stabilize services, this fiber was disabled, rerouting all traffic onto the single remaining fiber.  This restored UA to proper functional status until fiber repairs could be accomplished.  Full service was restored Oct 21th at 0100 when ACS and UA engineers re-enabled the repaired fiber and tested functionality.

Develop Countermeasures 

This issue was caused by a hardware fault in ACS’ gear, there are not many countermeasures that we can take to prevent this issue from happening again.  Short of provider diversification, which has historically been cost-prohibitive.

Posted Nov 03, 2021 - 10:27 AKDT

Resolved
ACS has completed their maintenance and no more interruptions are expected.
Posted Oct 20, 2021 - 10:18 AKDT
Monitoring
ACS is continuing to troubleshoot their circuit to the University network. We've implemented changes to maintain stable connections while they resolve issues with the original circuit. Thank you for your patience.
Posted Oct 14, 2021 - 13:45 AKDT
Identified
University of Alaska engineers have engaged with ACS support, and believe the root cause of today's internet issues stem from a malfunctioning ACS device. ACS technicians will be performing emergency maintenance to remediate the issue tonight between 1:00AM and 3:00AM.

All traffic to/from Statewide Fairbanks, UAF Campus, and UAF affiliated remote sites will experience a network outage lasting around 10 minutes during the active part of the maintenance window.
Posted Oct 13, 2021 - 16:07 AKDT
Investigating
We are currently investigating reports of slow or non-functional internet for vpn users and remote campus users. Thank you for your patience as we look into this.
Posted Oct 13, 2021 - 12:03 AKDT
This incident affected: UA Network Connectivity (Statewide Network Connectivity, UAF/SW VPN).