T-Mobile screwups caused nationwide outage, but FCC isn’t punishing carrier

  News
image_pdfimage_print
A T-Mobile advertisement that says,
Enlarge / T-Mobile advertisement in New York City’s Times Square on October 15, 2020.
Getty Images | SOPA Images

The Federal Communications Commission has finished investigating T-Mobile for a network outage that Chairman Ajit Pai called “unacceptable.” But instead of punishing the mobile carrier, the FCC is merely issuing a public notice to “remind” phone companies of “industry-accepted best practices” that could have prevented the T-Mobile outage.

After the 12-hour nationwide outage on June 15 disrupted texting and calling services, including 911 emergency calls, Pai wrote that “the T-Mobile network outage is unacceptable” and that “the FCC is launching an investigation. We’re demanding answers—and so are American consumers.”

Pai has a history of talking tough with carriers and not following up with punishments that might have a greater deterrence effect than sternly worded warnings. That appears to be what happened again yesterday when the FCC announced the findings from its investigation into T-Mobile. Pai said that “T-Mobile’s outage was a failure” because the carrier didn’t follow best practices that could have prevented or minimized it, but he announced no punishment. The matter appears to be closed based on yesterday’s announcement, but we contacted Chairman Pai’s office today to ask if any punishment of T-Mobile is forthcoming. We’ll update this article if we get a response.

FCC details T-Mobile mistakes

The staff-investigation report identified several mistakes made by T-Mobile during the outage, which began as T-Mobile was installing new routers in the Southeast US. When a fiber transport link in the region failed, T-Mobile’s network should have transferred traffic across a different link. But the carrier “had misconfigured the weight of the links to one of its routers,” which “prevented the traffic from flowing to the new active router as intended.” T-Mobile hadn’t implemented any fail-safe process to prevent the misconfiguration or to alert network engineers to the problem.

The Atlanta market “became isolated” from the rest of the network, causing all LTE users in the area to lose connectivity. A software error made things worse by preventing mobile devices in the Atlanta area from re-registering with the IP Multimedia Subsystem over Wi-Fi. Instead of routing device-registration attempts to a different node, “the registration system repeatedly routed re-registration attempts for each mobile device to the last node retained in its records, which was unavailable due to the market isolation.”

The software error had existed in T-Mobile’s network for months. “This software error likely did not cause problems before this outage occurred because the outage was the first notable market isolation since T-Mobile integrated this software into its network,” the FCC said. Regular testing “could have discovered the software flaw and routing misconfiguration before they could impact live calls,” the FCC also said.

After the trouble on June 15 began, T-Mobile engineers “ended up exacerbating [the outage’s] impact because they misdiagnosed the problem.” The FCC report continued:

T-Mobile believed that the fiber transport link that failed earlier in the day was continuing to cause the ongoing outage. Acting on this belief, T-Mobile manually shut down the link in an attempt to transfer traffic away from it. Due to the still-misconfigured Open Shortest Path First weights, however, these steps recreated the outage’s initial conditions. LTE customers in the Atlanta market were again disconnected from the LTE network and forced to establish calls over Wi-Fi, and their registration attempts again failed and created a registration storm that added further congestion to T-Mobile’s IP Multimedia Subsystem.

T-Mobile engineers almost immediately recognized that they had misdiagnosed the problem. However, they were unable to resolve the issue by restoring the link because the network management tools required to do so remotely relied on the same paths they had just disabled. When T-Mobile engineers were able to access the equipment on site and correct their mistake by restoring the link an hour later, customers in the Atlanta market were again able to attempt to register to VoLTE [Voice over LTE]. However, this again created additional congestion because T-Mobile engineers had not yet addressed the software error that prevented registrations from completing.

Outage goes nationwide

The FCC report explained how the outage spread from the Atlanta market, going nationwide. External traffic destined for the Atlanta system was redirected to other regions, which “created enough congestion in those registration systems to cause the T-Mobile network to send the registration attempts to other nodes. The software error again routed re-registration attempts to the last node on record, which was likely already experiencing severe congestion.” Shortly after, “IP Multimedia Subsystem, VoLTE, and Voice over Wi-Fi registrations began to fail nationwide.”

The vast majority of T-Mobile customers were unable to connect to Voice over LTE or Voice over Wi-Fi networks and thus “fell back to T-Mobile’s 3G and 2G circuit-switched networks to make and receive calls while the device continued its registration attempts to the VoLTE network.” This resulted in 3G and 2G congestion, causing many phone calls to fail. Network nodes continued to hold resources for these call sessions after the calls terminated, overwhelming the nodes’ computing resources and causing even more call failures.

911 calls can typically be made even when mobile devices can’t complete registration with the IP Multimedia Subsystem, but in this case, 911 was affected by the 3G and 2G network congestion “because the same network nodes that choose gateways for calls destined for 2G and 2G networks also choose gateways for 911 calls. When those nodes’ computing resources became overwhelmed by abandoned call sessions’ resource reservations, it also caused many 911 calls to fail,” the FCC said.

T-Mobile told the FCC that 23,621 calls to 911 didn’t reach public safety answering points [PSAPs] due to congestion during the outage. Another 111,253 emergency calls were successfully completed. Including both 911 and non-emergency calls, at least 41 percent of calls on T-Mobile’s network failed during the outage, the FCC said.

This could have been avoided or minimized if T-Mobile had implemented “reasonable 911 network monitoring,” which “would have revealed to T-Mobile in real time that the outage was causing call blocking on PSAP administrative lines,” the FCC said.

T-Mobile has since corrected technical problems identified due to the outage and made other changes to prevent or reduce the severity of future outages, the commission report said.

Hey T-Mobile—please don’t do that again

In a press release yesterday, Pai again criticized T-Mobile. “T-Mobile’s outage was a failure,” Pai said. “Our staff investigation found that the company did not follow several established network reliability best practices that could have either prevented the outage or at least mitigated its impact. All telecommunications providers must ensure they are adhering to relevant industry best practices, and I encourage network reliability standards bodies to apply their expertise to the issues identified in this report for further study.”

Despite that, Pai announced no punishment.

“In keeping with past practice, the [FCC’s Public Safety and Homeland Security] Bureau plans to release a Public Notice, based on its analysis of this and other recent outages, reminding companies of industry-accepted best practices, including those recommended by the FCC’s Communications Security, Reliability, and Interoperability Council, and their importance,” the FCC said. “In addition, the Bureau will contact major transport providers to discuss their network practices and offer assistance to smaller providers to help ensure that our nation’s communications networks remain robust, reliable, and resilient.”

This is similar to what happened last year when an FCC investigation into mobile carriers’ response to Hurricane Michael in Florida found that carriers failed to follow their own voluntary roaming commitments, unnecessarily prolonging outages. Pai called the carriers’ responses to the hurricane “completely unacceptable” but imposed no punishment related to the bad hurricane response and continued to rely on voluntary measures to prevent recurrences.

Pai’s FCC also let Verizon, T-Mobile, and US Cellular off without any punishment after finding they exaggerated their 4G coverage in official filings. Pai has proposed fines for AT&T, Verizon, T-Mobile, and Sprint to punish the carriers’ illegal sales of phone-location data, but the penalties of $12 million to $91 million per carrier were criticized by Democrats as not big enough relative to the harm to consumers.

https://arstechnica.com/?p=1716835