Wireless Networks Outage
Incident Report for OIT Services
Postmortem

Wireless Network Outage

Problem Impact Analysis

Event Occurrence: 2/28/24

Background

Wireless networks at the University of Alaska Fairbanks, including UAlaska, eduroam, UAGuest and DeviceNet are sourced from our Cisco 8540 series Wireless LAN Controller (WLC) High Availability (HA) pair located in the Butrovich Data Center.  For "FlexConnect" wireless networks eduroam, UAlaska and DeviceNet, as well as UAGuest in residential areas, the WLC handles only authentication of users to the wireless network with user network traffic being switched out locally in the buildings where the client is located.  For UAGuest wireless network in non-residential areas, the WLC handles both user authentication as well as central switching of user network traffic.

Break Down of the Problem

On February 28th, 2024 at shortly after 4:30pm, OIT Telecommunications Services personnel personally noticed, and then began receiving additional reports of users being unable to associate/authenticate to wireless networks on the UAF campus.  Initial investigation found the WLC in the Butrovich Data Center inaccessible.  Users currently associated to UAlaska, eduroam or DeviceNet SSIDs were able to stay connected and pass traffic, but users without an existing association were not able to successfully associate.  All UAGuest users outside of residential buildings were unable to connect at all.

Target State / Goal 

UAlaska, eduroam, UAGuest and DeviceNet wireless SSIDs should remain available for association, authentication and normal network traffic at all times, aside from scheduled maintenance periods.

Root Cause Analysis 

Upon investigation OIT Telecommunications Services personnel discovered that the network port channels on the Butrovich Data Center Nexus switching infrastructure which connect to the Cisco 8540 WLC HA pair had shut down due to BPDUGUARD detecting BPDUs (bridge protocol data units) on the VLAN that our UAGuest centrally switched wireless clients are placed in.  Further investigation revealed that a researcher had been attempting to connect a piece of equipment to the UAGuest wireless network, utilizing an ethernet to wireless bridging unit, for testing prior to deployment at a remote location.  During the researcher's attempts to connect their equipment, they had connected the ethernet to wireless bridging unit to the UAGuest network and simultaneously plugged it into the wired network in the building they were located in.  This, unfortunately, bridged that building network back to the centrally switched network in the Butrovich Data Center for UAGuest wireless clients.  The network loop prevention protocols on the Data Center network then came into effect and shut down the port channels to the WLC, resulting in this outage.

Develop Countermeasures 

Network loop prevention protocols are vital to the continued function of the network, including the Butrovich Data Center.  As this was the first incident of its type seen in many years of operating a WLC based wireless implementation, it is nonrepresentative of the normal usage of the UAGuest wireless network.  User education regarding the cause, as well as the scope and severity of its consequences was deemed most appropriate to prevent future recurrence.

Implementation of Countermeasures

OIT Telecommunications Services personnel reset the network port channels connecting the WLC in the Butrovich Data Center, which returned the WLC to normal service.  Personnel additionally followed up with the researcher who had inadvertently bridged the networks causing this interruption.  The researcher was helpful in determining the exact situation that resulted in this incident.  The researcher understands, to an adequate degree, the mechanism by which their attempts resulted in this service interruption.  They also: understand the severity of impact; will avoid making the same error again; and will seek OIT assistance when attempting to accommodate similar needs in the future.  

Follow Up / Review

Should further incidents of this nature occur in the future, the stance regarding technical implementations may be reviewed/reconsidered.

Posted Mar 13, 2024 - 10:34 AKDT

Resolved
This incident has been resolved.
Posted Mar 01, 2024 - 09:50 AKST
Monitoring
We experienced a temporary outage for our Wireless services, we were able to restore functionality and are currently monitoring these services this evening.
Posted Feb 28, 2024 - 17:26 AKST
This incident affected: UA Network Connectivity (UAF Wireless).