6

A nice story about Unix processes "infecting" each other

 2 years ago
source link: http://rachelbythebay.com/w/2022/02/09/nice/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

A nice story about Unix processes "infecting" each other

I once worked at a place that had an interesting little problem stemming from the way they managed machines and some of the fun side-effects processes can have on each other. This is back in the pre-systemd days of running services, such that things tended to be started or restarted by someone or something that itself ran as root. That is, if you started something, it might inherit one or more things from your environment. Likewise, if you ran something with an unusual environment, it might bestow those "gifts" onto anything it happened to touch.

This story involves a lot of weird little things going back a ways. Anyone who used to run OpenSSH before it was a stock part of their Linux distribution probably ran into the problem where you'd apply an upgrade to glibc and then OpenSSH would stop letting people log in. This ultimately turned out to be a mismatch between the dynamically linked bits of glibc in the running sshd and the NSS stuff for authentication on the disk that it would grab on the fly.

After locking themselves out of machines too many times, people generally learned that after you mess with a glibc upgrade, you restart the sshd listener on port 22. Of course, they also had to learn to not whack EVERY sshd at the same time, lest they kneecap the very thing that was letting them on the box, but that's another problem entirely.

Okay, so after whatever summer it was when we had to keep patching these things every few weeks (2002? 2003?), people probably had this drilled into them. Keep that in mind as we go forward.

So we had this system management stuff that worked by having a whole bunch of little scripts that would run to take care of things. It ran everywhere on the fleet several times an hour, and it would fetch updates to those scripts, then run them and upgrade (or downgrade) packages, restart things, and generally keep the boxes up to date.

For various reasons, it ran "niced" to 19, meaning the Linux scheduler was told to take care of other processes which wanted the CPU before it came along. It also ran "ioniced", which told Linux that it should try to take care of other processes who wanted to do disk operations before scheduling this thing.

Normally, that was fine, but sometimes this had an interesting effect.

One time, a glibc update went out. This management stuff noticed and upgraded the packages. Then, because of the NSS shenanigans from over a decade before with OpenSSH, their "postinstall" script for the glibc package included a "service sshd restart". This forced it to stop and restart the listener on port 22 so it would still be able to let people log in.

This sshd got started inside the "doubly niced" environment, and so it too was running with very low CPU and disk priorities. You'd think that might not be a huge deal if it was just the occasional human who would ssh in to these otherwise identical boxes for troubleshooting or whatever. It probably would have been fine, but there wouldn't have been a story if not for the next part.

This company also had a home-rolled "deploy" program that worked... by sshing into the box and running stuff. So, if this machine was supposed to serve cat pictures, it might ssh in and run commands to install the cat picture storage stuff, and then it would run the command to start it up.

It should come as no surprise that the cat picture server *also* started with really loose priorities even though it was the very reason for the machine to exist. If anything, it should have been pushed to the front, since that program was really the ONLY thing that machine should have been worrying about. Everything else was just auxiliary cruft for housekeeping and whatnot.

Imagine trying to troubleshoot this. You find out that machines are getting slow, but only after you do a fresh deploy to them, and then only sometimes. But, once it happens, it starts happening *everywhere*. Then, if you reboot the machine (SIGH... but it happens), then it just vanishes and stops happening until maybe months go by, then it happens all over again.

Someone eventually realized that the cat picture program was being niced/reniced down to nothing, and that it inherited it from sshd, which inherited it from the management stuff, which only restarted it because it was trying to avoid a bigger disaster of having all logins fail from that point forward.

As for why "the processes didn't notice and then undo the nice/ionice values", think about it. Everyone assumes they're going to get started at the usual baseline/default values. Nobody ever expects that they might get started down in the gutter and have to ratchet themselves back out of it. Why would they even think about that?

Truly, this was a way for processes to infect other processes, and this inspired the kind of naughty name one of my coworkers gave it: "unixherp". Yep.

Incidentally, since we don't really start persistent server-type processes directly in the systemd world, I don't think it would happen any more. You run systemctl to say "start X", and then it goes off and does it in its own context. The intent is conveyed without any of the extra funk you might have picked up along the way. How about that.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK