Releasing software to the fleet far too quickly broke stuff

There were some people at a big company, and they had many computers -- like more computers than you'd normally think of anyone having. They had them in vast fleets spread all over the world.

Some of those people built software for those computers. This was not the OS software like Red Hat or Ubuntu at the bottom of the stack. No, this software sat just above that, and added a bunch of features that the OS could not or would not provide. These were things like configuration management and service discovery at the kind of scale this place needed.

It was this middle stuff which made it possible to run the actual services on top. The services are what provided utility to the customers. That's where the recipes, or cat pictures, or videos, or food orders, or gig worker matching or whatever else would live.

This middle layer of infrastructure tended to live in packages a lot like the ones used for the base OS since it was simpler to distribute and update that way. There was a config management regime on these computers, and they had a scheme which was pretty simple when it came to these infra packages. It went like this:

Wake up once in a while. Look at what version of the package is on the box. Look at what version of the package is available in the "package repo". If that one is newer than the one I have, bring it in and install it.

Every device in the fleet did this every 15 minutes or so. As a result, once you managed to commit a new package to the repo, a very large number of machines would pick up that change within 15 minutes. If you screwed something up, it would trash basically everything in the time it took you to notice and start trying to react. Better still, there was no "panic button" or anything else that would reliably stop the updates from driving every system off a cliff. Once it was committed, you were basically going to watch it happen in slow motion whether you liked it or not.

Unsurprisingly, this was eventually deemed a problem. Something needed to change. Work commenced on a new way to manage these infrastructure packages, such that they wouldn't all just wake up and hop to the latest version every time one appeared.

Here's how it worked: there would now be a pointer file for every package controlled by this system. That pointer file could contain two versions: there was the old one and then the new one. The pointer file also contained a number which specified which "phase" was active. This was some added magic which let you select groups of machines for an incremental rollout instead of going from 0 to 100% in one step.

Each numbered phase referred to a conditional test in an array. You could use a bunch of techniques to select a subset of the fleet. For example, phase 0 was frequently just a handful of systems, like the personal dev boxes used by the people working on the project. It might actually be a literal thing, like "hostname is one of (a, b, c, d, e)".

If your hostname was one of those 5, it would return true, and if they were at phase 0, it would activate and trigger loading the new version instead of the old version.

Subsequent phases would add on larger and larger groups, like "all developer machines across the company", or "all of the west coast", or "this entire datacenter", or "50% of everything", leading up to the final not-a-numbered-phase "global" which meant "every single machine gets this".

Some teams were very happy with this. They set up a bunch of phases and rolled out their releases in steps. These teams tended to catch things before they made it to 100% distribution. They weren't the problem.

The problem was that some teams set this up and then promptly ignored it. It was totally possible to leave the pointer at "phase global", and then just flip the version string from one thing to the next. This would make every single machine wake up and take the new version the next time it ran a config check (every 15 mins or so).

At least one team did exactly this... and went largely unnoticed until one day they somehow came up with a multi-gigabyte package and dropped it on the entire fleet at the same time. THUD.

Every single server had to download this thing (using network bandwidth), drag this thing in to the disk (consuming disk bandwidth), burn CPU time and memory to decompress it, burn more disk bandwidth to write it out uncompressed, then fling all of the contents to their final destinations (even more disk bandwidth) and update the various databases to say "yes, this package is now here".

Since this happened everywhere more or less at the same time, it meant the entire company's fleet of services all slowed down at once, and a bunch of things sagged under the load. People definitely noticed. It caused an outage. A SEV was opened. It came to review. They had to admit what had happened.

At that point, it was decided that something easier had to be done. The barrier to entry for safe and sane rollouts had to be lowered even more to bring more teams on board. The alternative of letting it happen again and again was unacceptable.

This was one of those times when we needed to build a system that made it stupidly easy to do the right thing, so that it would actually be MORE work to do the wrong thing ... like going 0 to 100 in a single step. Some people obviously couldn't be bothered to step their release along over the course of a couple of days, and we had to find some way to protect the production environment from them.

We would up building something which did exactly that. I'll tell the story of how that came to be another time.

Recommend

Blazor WebAssembly 3.2.0 已在塔架就位将发射新一代前端SPA框架

DockOne微信分享（二五六）：哆啦A梦：基于Prometheus的企业监控报警平台

Stream—一个早产的婴儿

机器学习中的召回、精确、准确，究竟是什么指标？

一季度三大运营商5G用户超5000万日赚约3.57亿元

时间管理大师罗志祥出轨的成本有多高？

朋友圈广告“走样”:为何频向女生推男士内裤？

罗志祥深陷丑闻，品牌该如何快速急救？

苹果、“华米OV”们日子不好过，他们的供应商怎么活下去？

人类携带的尼安德特人 DNA 对外表和疾病基本无影响

About Joyk