A parasitic RPC service that broke core dumps

I already told a story about how you can hose your machine by pointing the Linux kernel at a custom core dump handler and then have that handler fail to function. This'll clog up the pipe (really), and then everything else will just stack up behind it. Pretty soon, every crashing process will be stuck just the same.

What I left out of that story is how it got to that point one time.

Once upon a time, in a big company environment, someone decided that they wanted to provide a service to other things running on the same Linux box. But, instead of either writing Yet Another Service (tm) that would go through release processes and all of that good stuff, they went for a different design entirely.

Yes, this service lived in *client* programs. Every time a program built using the usual corp RPC transport stuff started up, it would try to bind to a certain TCP port. If it succeeded, it said "well, I guess I'm the winner" and proceeded to start up this service on that port. Otherwise, the idea was that it would just then connect to that port and talk to whoever got there first.

There are so many problems with this. You have the issue of a "server" that appears out of thin air depending on who or what starts first on the machine. It takes up CPU time and memory that gets "charged" back to the process that "won", instead of being its own thing that tracks back to whoever wrote this thing in the first place.

Imagine if the first car that got the freeway every morning after a certain time had to follow everyone to their offices and make coffee for them. That's the kind of random, completely nonsensical and difficult to troubleshoot thing I'm talking about.

Better still, apparently this thing they did was buggy. As the story goes, it started threads to do this stuff, and it managed to have some circular references such that it would never get a refcount of 0, and so it would never actually go away. Better still, the thing which attempted to grab the port had an infinite timeout, so it would just try forever. It never actually failed, so it never gave up and shut down.

This meant every program built after a certain commit in the tree had this misfeature in it, and could not shut down cleanly by itself. You'd have to shoot it in the head with a strong enough signal to make it go away.

Now, given that basically everything eventually used this library, imagine what happened when the core dump helper thing got a hold of it. It too got stuck, and then failed to shut down. This meant the kernel never managed to finish its core dump sequence when something on the box crashed. That then backed up everything behind it.

Basically, this bug had been breaking other programs up to that point, but when it reached the core dump helper program, that's when it turned really bad. It took that kind of visibility to bring enough people to bear on the problem and finally shut it down for good.

My own heuristic for picking up on this kind of thing is that every thread but one is in exit(), and the last one is in some pipe_* function pretty deep inside the kernel... and probably a few stack frames past do_coredump or similar.

If it happens to you, hopefully you'll remember this.

A parasitic RPC service that broke core dumps

A parasitic RPC service that broke core dumps

Recommend

拆穿元宇宙：NFT被玩坏的至少三种可能

被vector动态扩容给坑了！ - 东北码农

俄乌冲突使全球半导体芯片供应链更加复杂化

How Dropbox Replay keeps everyone in sync

Making instagram.com faster: Part 3 — cache first

去中心化科学市场和利润共享

Search Journey Towards Better Experimentation Practices

最新研究：自建数据中心互联，三年投资回报率达325%

公司内部一次关于OOM故障复盘分享 - 星巴克男孩

理解“闭包” - _哲思

About Joyk