4

How We Found Azure’s Unannounced Breaking Change

 1 year ago
source link: https://metrist.io/blog/how-we-found-azures-unannounced-breaking-change/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How We Found Azure’s Unannounced Breaking Change

Nikko Campbell

Although APIs are the lifeblood of so many software applications today, the impact on their dependencies when they break can often be overlooked. And while full outages can have a more widespread and obvious impact across a user base, breaking changes to an API’s contract can be just as impactful – and much more difficult to resolve.

This was the case when Azure published a breaking change that affected downstream systems like ours without warning its users. But, the source of the breaking change was not clear at first, and we at Metrist had to investigate using our own data for third-party application reliability in order to determine the source of the issue.

In this article, we’ll walk through the data surfaced by our own product and the process of how we determined that Azure’s breaking change was the source of our partial outage.

Background

First, a bit of background on how we use Azure CosmosDB, and how we obtained the debugging data to determine that Azure published a breaking change without telling their customers. Metrist monitors cloud dependencies, such as Azure CosmosDB, by continuously running scripted test that exercise the product’s functionality to simulate what a user would do. And part of our functional testing involves using their API.

Azure published a breaking change – making a previously optional field that we considered irrelevant mandatory, and creating errors in our monitoring. This was the start of how we identified the unannounced breaking change and investigated how to resolve the issue.

Discovery of the Outage

We first noticed the problem with the partial outage of our Azure CosmosDB service monitor September 14th at 18:42 UTC. We reported to our users that the product was not functioning correctly because our tests began returning errors, not on every attempt, but on enough to trigger Metrist’s heuristics that something was wrong in the East US region of Azure while attempting to create a database container, as seen below.

Screen-Shot-2022-10-11-at-4.11.47-PM-1024x562.png

Here is how we saw the issue unfold: 

  • The first error came in at 18:42 UTC on September 14th
  • At 19:12 UTC, we declared the service to be partially down because we could not continue to successfully create containers in the East US region
  • Following the first error on the 14th, 75% of test runs failed through the remainder of the day, with 25% succeeding.
  • On September 15th through September 20th, Metrist continued to report the service as partially down as 90% of the 140 daily tests in the East US region failed day.
  • On September 19th @ 19:13 UTC, the first error when attempting to create a cluster in in the Central US was logged, with occasional successful tests for the next few hours until 100% of tests failed by the start of September 20th, UTC.
  • On September 21st 100% of test runs were failing in the East US region.

Using a feature of our product that details errors from our functional tests, we were able to see a log of all the recent errors thrown by the monitor, with CreateContainer specifically showing several errors stating “One of the specified inputs is invalid”.

While this data was not the most helpful, it confirmed that there was something wrong with our request, despite no changes to how we made the request in months. Until seeing the error details in Metrist, for all we knew that could have just been because of the specific region having issues (we’ve seen capacity issues before on single regions from cloud providers that prevented us from spinning up a new database or VM).

Screen-Shot-2022-10-11-at-4.12.06-PM-1024x470.png

After a few days, this error was happening on every attempt and began affecting the Central US region. At that point, it seemed the mostly likely reason was that Microsoft was slowly rolling out a new change that broke compatibility with our existing code.

Determining the Change

The next step was to figure out what actually changed. Our service monitor code hadn’t been touched in several months and is fairly straightforward, using the account and database created from its previous successful steps to create a new SQL Container.

public void CreateContainer(Logger logger)
{
   _dbAccount.Update()
       .UpdateSqlDatabase(GetDatabaseName())
       .DefineNewSqlContainer(GetContainerName())
       .WithThroughput(400)
       .Attach()
       .Parent()
       .Apply();
}

Nothing seemed obviously wrong. The next investigative steps were:

  • Updating our outdated SDK – no change. 
  • Determining if allowed throughput values changed – no change.
  • Determining if changing the default value of 400 resulted in a change – no change. 
  • Determining if containers could even be created – no change. 

Although other regions still worked and there was nothing reported on the status page, it was worth having a try through the web console as a sanity check.

Screen-Shot-2022-10-11-at-4.12.35-PM-488x1024.png

The Solution 

Filling out the “New Container” form, everything seemed fine until the Partition Key. It was required and had a default of “/id”, surely the SDK would set a similar sane default that Just Worked™.

However, manually setting the PartitionKey with a simple .WithPartitionKey(“/id”) was all it needed to start working. This was even more confusing with the inclusion of a PartitionKey.None that implies not setting the value should be an option.

Fortunately, this was all we needed to get the our service monitor formatted the way CosmosDB suddenly expected, in just 2 of the regions (plus similar changes to the InsertItem and DeleteItem checks). However, it still took a full morning to fully debug and fix this all because of a previously optional field becoming required and the only clue was found while using the Azure web portal, in a world of Infrastructure as Code.

Conclusions

All in all, this was a fairly simple fix, but because of the nature of the API, it could have been worse. The more deeply ingrained a given API is into one’s app, the bigger an effect it can have, and the longer it can take to fix.

Could our problem have been avoided? A production deploy of CosmosDB would almost certainly have a partition key set regardless of it being required or not. But unfortunately, it’s not always possible to avoid these kinds of breaking changes.

Following best practices and recommendations from providers and getting alerted to upcoming changes is your best bet to prevent a breaking API change from impacting your application. And at the end of the day, you still need to trust that your cloud dependency providers will take the precautions necessary to avoid service interruption.

At Metrist, a service monitor breaking like this is almost welcome. It allows us to learn about changes being made by an API provider that could have a serious impact on customers, but doing so in a controlled, purpose-built environment. The same can’t be said for most other apps, where an outage can cause serious loss of time and business.

If you’d like to save time and reduce outages for your business, sign up for Metrist! You can monitor any of 65+ popular cloud apps or install our agent to monitor any cloud dependency you use!

Get started today – it’s free!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK