A deep dive into Akamai's outage last week, as seen through Cisco's ThousandEyes

Tuesday, 22 June 2021 11:15

By Mike Hicks, the Principal Solutions Analyst at Cisco ThousandEyes.

GUEST OPINION: Mike Hicks, the Principal Solutions Analyst at Cisco ThousandEyes, has cast his eyes over Akamai's Prolexic outage last week to analyse what happened, which as Akamai noted, wasn't due to a cyber attack but an inadvertently exceeded routing table value. How did this happen, who was hardest hit, and how can downtime be mitigated? Please read on!

How it happened

The issue originated in Akamai’s DDoS mitigation service, Prolexic Routed. In Akamai’s statement announcing service restoration, it noted that the outage was not due to either a “system update” or “cyber-attack” but rather by inadvertently exceeding a “routing table value” used by the service.

This is consistent with what ThousandEyes observed. Prolexic is designed to prevent DDoS attacks in the cloud before reaching the customers’ networks; it does this by advertising the customers’ network prefix, with their permission, and providing the connection back into the customers’ network.

Around 2:20PM AEST, ThousandEyes observed the Prolexic service initially losing connections with its peers and downstream service providers, effectively preventing any traffic passing through the Prolexic service onto the customer destination, followed by a period of route flapping, where the downstream providers were unable to get updated path information regarding the destination customer networks, before trying to reach the target destinations through a number of pre-calculated alternate paths before eventually withdrawing the routes, leaving the destination customer networks unreachable.

Who was hardest hit?

Prolexic is deployed in two modes: an on-demand mode, where traffic is redirected when an attack is identified, and an always-on mode, where all the traffic is routed through the Prolexic infrastructure. The always-on customers were the deployments impacted by this outage. As banks typically rely on always-on DDoS mitigation services, they were also some of the hardest hit by this outage.

The duration and subsequent customer impact was influenced by a number of factors. Firstly, the time of day the outage occurred meant that it predominantly fell outside of normal business hours for the U.S, but in the middle of the working day for Australia. The second factor was how quickly the customer was able to bypass the Prolexic environment to restore connectivity and services for its own users.

Organisations that had an automated failover system in place, designed to trigger redundancy processes, such as re-advertising prefixes etc., were able to recover connectivity within minutes. Whereas organisations that had to rely on manual intervention to re-establish connectivity took longer to propagate the changes needed to re-establish connectivity to their networks via alternate paths.

Ways to mitigate downtime?

Organisations that were able to react the quickest to the Akamai outage were those who had backup plans in place. Some of those were based on automated actions, others manual, and involved things like re-advertising the impacted networks under a different network prefix that itself wasn’t being impacted by the outage, therefore restoring connectivity to their own networks. Ultimately you need to have a backup plan for when outages inevitably happen, which they inevitably do, and this should include visibility into early warning indicators so you know if and when to activate backup procedures.

Other mitigation efforts include diversifying delivery services for web content, i.e. adding redundancy, as well as making sure to understand all of the third-party dependencies that can impact customers’ web and app experience.

In the Prolexic outage, another set of organisations impacted were those who had no direct relationship with Akamai, but relied on back end systems and third-party services that depend on Prolexic to route traffic to these back end environments. Payment processing is a good example of the kind of service that runs on multiple hidden providers.

For these organisations, everything seemed to be operating normally since the main website and application was available, but when a backend process was called on, for example an e-commerce site triggered by a payment request, the service simply timed out. Any lack of visibility into all the associated dependencies potentially meant that some organisations were not immediately aware they were impacted and, as a result, took longer to both identify and fix the issue.

Modern applications and services today are built on a multitude of best-in-breed functionality that runs on different systems and processes to execute transactions and requests. To make sure customer experience is never compromised, it’s critical to see and understand all of the components and dependencies that have the power to take down your service.

Redundancy yes, but will one size fit all?

Backing up the backup-ers is certainly a valid approach in some cases but in others, the extra expenditure might not have justified the ultimate cost of downtime. If we take an Occam’s razor approach, it is likely that the best approach is automated redundancy rather than to pay for two services to run in parallel, or to have a second DDoS mitigation service paid for just to sit on standby.

Fundamental to any diversification approach, you need visibility to make informed decisions so that by continuously evaluating the availability and performance of your service delivery, you can ensure that you have proactive awareness of potential issues that will enable you to respond and resolve quickly, and plan accordingly for the future.

By Mike Hicks, the Principal Solutions Analyst at Cisco ThousandEyes.

Subscribe to ITWIRE UPDATE Newsletter here

GRAND OPENING OF THE ITWIRE SHOP

The much awaited iTWire Shop is now open to our readers.

Visit the iTWire Shop, a leading destination for stylish accessories, gear & gadgets, lifestyle products and everyday portable office essentials, drones, zoom lenses for smartphones, software and online training.

PLUS Big Brands include: Apple, Lenovo, LG, Samsung, Sennheiser and many more.

Products available for any country.

We hope you enjoy and find value in the much anticipated iTWire Shop.

ENTER THE SHOP NOW!

INTRODUCING ITWIRE TV

iTWire TV offers a unique value to the Tech Sector by providing a range of video interviews, news, views and reviews, and also provides the opportunity for vendors to promote your company and your marketing messages.

We work with you to develop the message and conduct the interview or product review in a safe and collaborative way. Unlike other Tech YouTube channels, we create a story around your message and post that on the homepage of ITWire, linking to your message.

In addition, your interview post message can be displayed in up to 7 different post displays on our the iTWire.com site to drive traffic and readers to your video content and downloads. This can be a significant Lead Generation opportunity for your business.

We also provide 3 videos in one recording/sitting if you require so that you have a series of videos to promote to your customers. Your sales team can add your emails to sales collateral and to the footer of their sales and marketing emails.

See the latest in Tech News, Views, Interviews, Reviews, Product Promos and Events. Plus funny videos from our readers and customers.

SEE WHAT'S ON ITWIRE TV NOW!

A deep dive into Akamai's outage last week, as seen through Cisco's ThousandEyes