7

Sometimes the dam breaks even after plenty of warnings

 6 months ago
source link: http://rachelbythebay.com/w/2024/03/05/outage/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Sometimes the dam breaks even after plenty of warnings

Oh dear, it's popcorn for breakfast yet again. Another outage in a massive set of web sites.

It's been about 10 years, so let's talk about the outage that marks the point where I started feeling useful in that job: Friday, August 1, 2014. That's the one where FB went down and people started calling 911 to complain about it, and someone from the LA County sheriff's office got on Twitter to say "knock it off, we know and it's not an emergency".

Right, so, it's been well-documented what happened that day, even on the outside world - SRECon talks, a bunch of references in papers, you name it. It was time for "push", and as it was being seeded, that process pretty much consumed all of the available memory (and swap) on the smallest machines.

Then there was this program which ran on every box as root, and its job was to run a bunch of awful subprocesses, capture their outputs, parse them somewhat, and ship the results to a time series database or a logging system. This program is the one that had the infamous bug in it where it would call fork() and saved the return value, but didn't check it for failure: the -1 retval.

So, later on, it went to kill this "child process" that never started, and did the equivalent of 'kill -9 -1', and on Linux, that whacks everything but yourself and pid 1 (init). Unsurprisingly, this took down the web server and pretty much everything else. This was pre-systemd on CentOS 5 machines running Upstart, so the only things that "came back" were the "respawn" entries in inittab, like [a]getty on the text consoles.

This is how we were able to fire up a remote console on one of the affected machines and log in and see that there was basically init, the shell that had just been started, and this fbagent process which was responsible for assassinating the entire system that morning.

The rest of the story has also been told, which is where it took me a couple of weeks to figure out why we kept losing machines this way, and when I did, I found the source had already been patched. Another engineer unrelated to the fbagent project had been hitting the same problem, decided to go digging, found the "-1" pid situation leaking through, and fixed it.

Even though the fix was committed, it wasn't shipped, because this binary was big and scary and ran as root on (then) hundreds of thousands of machines, and the person who usually shipped it was on vacation getting married somewhere. As a result, the old version stayed in prod for much longer than it otherwise would have, complete with the hair-trigger bug that would nuke every process on the machine.

All it needed was something that would screw up fork, and on that morning, it finally happened.

What hasn't really been told is that the memory situation had been steadily getting worse on those machines that whole summer. We had been watching it creep up, and kept trying to make things happen, but by and large, few people really cared. Also, people had been adding more and more crap to what the web servers would run. Back in those days, you could just tell your endpoint to run arbitrary code, and it basically would, right there on the web server!

Case in point: people had started running ffmpeg on our web servers. They decided that was an AWESOME place to transcode videos. By doing that, they didn't have to build out their own "tier" of machines to do that work, which would have meant requesting resources, and all of that other stuff. Instead, they just slipped that into a release and slowly turned up the percentage knob until it was everywhere.

ffmpeg is no small thing. One instance could pull nine CPU cores and use 800 MB of memory - that's actual memory, not just virtual mappings. Also, this made requests run really long, and when that happened, the "treadmill" in the web server couldn't happen sufficiently quickly.

What's the treadmill? Well, when you have memory allocations for a bunch of requests that then finish, you have to garbage-collect them eventually. My understanding is that the treadmill essentially worked by waiting until every request that had been active at the same time was also gone, and then it would free up the resources.

This is a little confusing so think about it this way. These machines were true multitasking, so they'd possibly have 100 or more web server threads running, each potentially servicing a request. Let's say requests A-M were running and then request N started up and allocated some memory. The memory allocated by N would only be freed once not only N was done, but A-M too, since they had overlapped it in time. If any of them were sticking around for a while, then N's resources couldn't be freed until that first one exited.

Given this, it's not too hard to see that really long-running requests effectively limit how often the "treadmill" can run, and thus how often the server will release memory for use in other things.

Also, there were other things going on which were just really expensive endpoints which could chew a gig of memory all by themselves. This was NOT scalable. You simply couldn't sustain that on these systems.

Basically, if you were to make a time-traveling phone call to me a few weeks before "Call the Cops" happened, and ask me what I was worried about, "web tier chewing memory and going into swap" probably would have been pretty high on the list.

To give some idea of how long this had been going on, that year, July 4th (a national holiday) fell on a Friday, so we had a "three day weekend". When this happened, the site didn't get pushed. This mattered because push would usually get the machines to free up a bunch of memory at once and generally become less-burdened.

A regular two-day weekend would leave things looking pretty thin by the time Monday's push rolled around, but a three-day weekend made things a lot worse... and this was a full month before everything finally broke.

So, yeah, the site broke that morning, but it's not like it was too surprising. The signs had been visible for quite a while in advance. Imagine standing on top of a massive dam and you start seeing one leak, then two, then four, and so on. You try to get help but it's just not happening.

Of course, once the dam actually fails, then somehow you find the resources to get people caring about dam maintenance. It's funny how that works.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK