Race-free process creation in the GNU C Library
source link: https://lwn.net/Articles/943022/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Race-free process creation in the GNU C Library
Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
Unix systems refer to processes via an integer ID (the "process ID" or PID) that is assigned at creation time. The problem with PIDs is that they are reused over time; once a process with a given PID has exited and been reaped, that PID can be assigned to a new and unrelated process with the result that any given PID might not, in fact, refer to the process that the user thinks it does. To address this problem, the pidfd concept was introduced; a pidfd is a file descriptor that acts as a handle for a process. The process associated with a pidfd can never change, so many of the race conditions associated with PIDs do not exist with pidfds.
Current glibc releases include wrappers for a number of the low-level pidfd-related system calls, including pidfd_open(), pidfd_getfd(), and others. There is one piece missing, though: the ability to obtain a pidfd for a new process as that process is created. It is possible to use pidfd_open() to get a pidfd from a PID immediately after creation, but that still leaves a narrow window during which the process identified by a PID could exit and be replaced by another. Closing that window requires obtaining a pidfd from the kernel as a result of creating a new process, and glibc provides no way to do that.
That functionality could be provided by adding a wrapper for the clone3() system call, but there is some resistance to doing that. Instead, Zanella has taken the approach of enhancing the posix_spawn() API, which is seen by many as being a better approach to process creation (when immediately followed by an exec() call) than the Unix fork() model. The result is two new functions:
int pidfd_spawn(int *restrict pidfd, const char *restrict file, const posix_spawn_file_actions_t *restrict facts, const posix_spawnattr_t *restrict attrp, char *const argv[restrict], char *const envp[restrict]); int pidfd_spawnp(int *restrict pidfd, const char *restrict path, const posix_spawn_file_actions_t *restrict facts, const posix_spawnattr_t *restrict attrp, char *const argv[restrict_arr], char *const envp[restrict_arr]);
Just like posix_spawn() and posix_spawnp(), these functions execute a combination of clone() and exec() to create a new process running the program indicated by file or path. The return value, though, will be a pidfd identifying the created process rather than a PID.
If the creator needs to know the new process's PID, that can be obtained by a new function added by the patch set:
pid_t pidfd_getpid(int pidfd);
This function obtains the PID by looking at the /proc entry for the given pidfd.
The new functions are implemented with clone3() to obtain the pidfd during process creation, without a race window. Using clone3() makes some other things possible as well, specifically creating the new process in a different control group than the creator's. Zanella has made this capability available as well, via an extension to the posix_spawn() attribute mechanism. Creating into a different control group is available for posix_spawn() as well as pidfd_spawn().
While posix_spawn() is seen by many as a better model for the combination of fork() and exec(), it does not provide all of the functionality that is available. For cases where this API is not sufficient, earlier versions of the patch set included a function called fork_np() as a separate wrapper around clone3() that would return a pidfd identifying the new child process. Florian Weimer complained that this interface differs too much from what the kernel provides, though, and is "not future-proof at all". He asked Zanella to leave this function out of the series for now, and it has been duly removed from later versions of the series.
Rich Felker, instead, objected to the concept in general, claiming that any PID-related races are "purely programmer error" and that "making a new, complex, highly nonstandard interface to work around a problem that's programmer error, and getting this nonstandard and nonportable pattern into mainstream software, has negative value". It would be better, he said, to fix the software affected by this problem. Luca Boccassi disagreed, though, saying that "these are real race conditions, that cannot be solved otherwise". Weimer also said that there was value in introducing the pidfd functionality.
While there has been no definitive resolution to this particular disagreement, the fact remains that PID races can be a problem, and there are users (such as systemd) that would like to have this type of API to avoid those races. It thus seems reasonably likely that pidfd_spawn() (though perhaps not fork_np()) will eventually find its way into glibc.
(Log in to post comments)
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 14:54 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 15:56 UTC (Fri) by mb (subscriber, #50428) [Link]
Oh no, please not yet another fundamental thing that depends on /proc being mounted.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 18:15 UTC (Fri) by dwest (subscriber, #110523) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 19:11 UTC (Fri) by mb (subscriber, #50428) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 19:21 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 19:27 UTC (Fri) by mb (subscriber, #50428) [Link]
One additional nail into the coffin of unprivileged containers?
>The way polkit/dbus
I'm talking about the fundamental pidfd API. Any process could use pidfds.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 19:35 UTC (Fri) by bluca (subscriber, #118303) [Link]
I'm pretty sure those can have /proc too?
$ id -u
1000
$ unshare -U -m --mount-proc -p -f
$ mount | grep img
proc on /tmp/img type proc (rw,nosuid,nodev,noexec,relatime)
> I'm talking about the fundamental pidfd API. Any process could use pidfds.
Sure, to do process tracking - what kind of process would you need to track in a chroot? Besides, it's all moot, this is not glibc's fault, the kernel provides this interface, so that's what glibc can use to provide an abstraction
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 19:36 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 20:57 UTC (Fri) by pbonzini (subscriber, #60935) [Link]
Any process that wants to spawn a process and use pidfd, but also write the pid in a log file or debug trace? Ignoring portability for a second, it could even be something like make or cargo.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 21:19 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 21:30 UTC (Fri) by pbonzini (subscriber, #60935) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 23:23 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 23:46 UTC (Fri) by josh (subscriber, #17465) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 0:43 UTC (Sat) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 1:08 UTC (Sat) by josh (subscriber, #17465) [Link]
(That operation would still be useful when passed a pidfd from elsewhere, but not *necessary* for the common case where you got the pidfd by creating a process.)
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 1:37 UTC (Sat) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 3, 2023 4:14 UTC (Sun) by IanKelling (subscriber, #89418) [Link]
I don't think it is hypothetical. From my sysadmin perspective, I often build software in a chroot without a /proc mount. Very rarely, the build has needed it and I wanted to know why. Bind bounding /proc, I see find shows 546,160 user-listabable files and 304,803 user readable files. Making that a requirement to create processes just because opt-in to an api that avoids a race condition would be roughly a regression in my book.
Race-free process creation in the GNU C Library
Posted Sep 3, 2023 10:26 UTC (Sun) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 4, 2023 9:16 UTC (Mon) by taladar (subscriber, #68407) [Link]
Race-free process creation in the GNU C Library
Posted Sep 4, 2023 9:53 UTC (Mon) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 22:07 UTC (Fri) by geofft (subscriber, #59789) [Link]
Meanwhile, the kernel has a feature where, if your current /proc is in any way overmounted, you're not allowed to mount a new /proc - because that would give you access to the files that are supposed to be hidden to you. This is also, in isolation, an understandable / defensible feature.
The intersection of these features is that you can't correctly mount /proc inside a nested container or container-like thing inside a non-privileged Kubernetes container. If you make a new pidns (either because you're root or via a new userns, as in your example), all the paths in /proc are wrong because they refer to outer PIDs.
(The intersection of these features also ceases to be really defensible in the case where you don't allow your Kubernetes workloads to run as pid 0, which is a really good idea on its own.)
There have been some patches for a second procfs (whose exact name I'm forgetting) that provides /proc/$pid/ and the /proc/self/ symlink but not anything else in /proc, but I don't think they've been merged. If those could get merged and guaranteed mountable by anyone with CAP_SYS_MOUNT in their current namespace, regardless of what the existing /proc outside it looks like or even whether it exists, that would satisfactorily address the issue.
I suppose another option would be for /proc to always enumerate the calling process's PID namespace, but maybe that gets weird with open file descriptors passed between PID namespaces.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 22:28 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 1:56 UTC (Sat) by cyphar (subscriber, #110703) [Link]
In fact this also means you can bypass the check entirely -- if you have a "safe" subset=pids mount in your namespace, the kernel will allow you to mount an unmasked (fully-fledged) procfs.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 20:24 UTC (Fri) by wahern (subscriber, #37304) [Link]
Moreover, procfs requires opening descriptors. But what if you've already hit your descriptor limit? Now rather than getting EMFILE, you get unexpected errors from syscall wrappers. And to avoid descriptor leaks, libc has to go through herculean efforts to make the syscall wrapper async- and thread-safe, and those efforts are definitely not always bug-free; or alternatively, now there's another threading/fork foot gun laying around.
None of these issues may be of concern to *you*, but they're of concern to other people, and have been for decades. Moreover, PID fds is an interface which people concerned about reliability, correctness, and security, have been desiring for a long-time; PID fd usability being tied to procfs substantially reduces the net value. Not all process management can be shoe-horned into systemd and other global services; far from it. Process management is often something ones needs to perform *after* dropping various privileges. That not all privilege separating or privilege reducing tasks can be performed immediately before or after exec, or cannot be reduced to one-line configuration directives, is precisely why OpenBSD's pledge and unveil are infinitely more ergonomic than comparable Linux solutions.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 20:48 UTC (Fri) by bluca (subscriber, #118303) [Link]
Nah, procfs supports various sandboxing features nowadays, and especially when unprivileged it necessarily implies a pid namespace so you do not have visibility in the rest of the system, only on processes in your pid namespace, and if it's a chroot that's going to be just the shell. If you are privileged, you can use the ProtectProc= systemd option (or if you are running on the 0.000x% of Linux install, mount /proc with the various hidepid= options that provide equivalent functionality)
> Moreover, procfs requires opening descriptors. But what if you've already hit your descriptor limit?
The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.
> PID fd usability being tied to procfs substantially reduces the net value.
Considering they've been available as-is for 4 years and nobody bothered to do anything about that, and have been providing great net value in the meanwhile, I'll have to take that with a grain of salt.
> Process management is often something ones needs to perform *after* dropping various privileges.
Not sure what that has to do with using procfs?
> is precisely why OpenBSD's pledge and unveil are infinitely more ergonomic than comparable Linux solutions.
I mean, if you dislike modern Linux so much and prefer OpenBSD, then just use OpenBSD? That's an absolutely fine thing to do.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 21:34 UTC (Fri) by bbockelm (subscriber, #71069) [Link]
Oh, the youthful banter of someone who hasn't spent a few hours this week debugging issues caused by file descriptor exhaustion!
(In this case, it was due to a hypervisor that booted a VM with trivial amounts of memory, the VM kernel adjusted system-wide file descriptor limits down accordingly, then the hypervisor would hotplug another 32GB of RAM later...)
For what it's worth, I agree this _should_ have been a problem relegated to history. I want to live in the future!
Race-free process creation in the GNU C Library
Posted Sep 6, 2023 8:39 UTC (Wed) by lathiat (subscriber, #18567) [Link]
I found the following very low Default:
# systemctl show --property=DefaultTasksMax
DefaultTasksMax=981
Which you also see in cgroupfs:
find /sys/fs/cgroup -name pids.max -exec grep -H . {} ;
The systemd docs state this is set based on threads-max "Configure the default value for the per-unit TasksMax= setting. See systemd.resource-control(5) for details. This setting applies to all unit types that support resource control settings, with the exception of slice units. Defaults to 15% of the minimum of kernel.pid_max=, kernel.threads-max= and root cgroup pids.max. Kernel has a default value for kernel.pid_max= and an algorithm of counting in case of more than 32 cores. For example with the default kernel.pid_max=, DefaultTasksMax= defaults to 4915, but might be greater in other systems or smaller in OS containers."
We then find a very low /proc/sys/kernel/threads-max of 6541. According to the kernel docs "During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages."
Despite being a pretty experience Linux performance engineer it took me a bit to find that one, as it only showed up in the cgroup limits and not in /proc/PID/limit.
Good times :)
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 21:56 UTC (Fri) by dezgeg (subscriber, #92243) [Link]
Is that really common to have no ulimit for them? 1024 fds limit has been very typical what I've seen (since default FD_SET size is that, so most programs that use select() will break on high fds)
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 22:29 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 4, 2023 20:18 UTC (Mon) by comex (subscriber, #71521) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 23:50 UTC (Fri) by josh (subscriber, #17465) [Link]
People have been bothering to do something about that, and it has taken this long to get something on a potential path to acceptance.
It's the fault of libc that we cannot simply call clone3 directly. It's the responsibility of libc to *stop hiding the underlying useful functionality* just because it thinks it knows better.
Race-free process creation in the GNU C Library
Posted Nov 14, 2023 23:57 UTC (Tue) by Rudd-O (guest, #61155) [Link]
Race-free process creation in the GNU C Library
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 6:37 UTC (Tue) by fw (subscriber, #26023) [Link]
So it's unfortunately not the case that proc is universally available or can be made so.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 18:28 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 20:32 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 20:36 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 1:10 UTC (Sat) by josh (subscriber, #17465) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 1:38 UTC (Sat) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 12:05 UTC (Tue) by hmh (subscriber, #3838) [Link]
While it looks at first glance that it would be "easy" to write one, that's for someone already used to working in that area of the kernel -- there are likely permission checks one need to get perfectly right to not create a security mishap, namespace concerns, etc. Experience in the specific area of the kernel you're working with almost always help a lot on the quality of the first public version of a patch, and faster acceptance in mainline for non-controversial changes.
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 12:29 UTC (Tue) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 20:48 UTC (Fri) by Karellen (subscriber, #67644) [Link]
I wonder if there would be any value in a system call that does the equivalent of open("/proc", O_PATH|O_DIRECTORY|O_CLOEXEC) and return an fd to the proc filesystem - even if /proc is mounted elsewhere or not at all? And similarly for /dev and /sys?
Then again, if admins wanted to limit access to those filesystems for a container, they'd need to implement some kind of seccomp-bpf/pledge style block, instead of just... not mounting those filesystems in the container.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 21:26 UTC (Fri) by NYKevin (subscriber, #129325) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 2:03 UTC (Sat) by cyphar (subscriber, #110703) [Link]
I have wondered whether it would be possible to allow fsopen("proc") to unprivileged processes but only for subset=pids -- this would solve many hacks needed in container runtimes to defend against certain attacks. Unfortunately, I suspect that even the new mount infrastructure is probably not going to be considered safe for unprivileged users to touch.
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 9:06 UTC (Thu) by Jonno (subscriber, #49613) [Link]
That would still let the unprivileged process learn of other processes on the system that it otherwise would be oblivious about.
But perhaps allowing something like `openat(pidfd, ".", O_DIRECTORY)` to get a fd equivalent to the /proc/<pid> directory except you can't ".." out of it would work.
Race-free process creation in the GNU C Library
Posted Sep 9, 2023 5:03 UTC (Sat) by cyphar (subscriber, #110703) [Link]
It's a bit of a shame, because that could've been the nicest behaviour -- though the contents of quite a few procfs files depend on the pid namespace associated with the procfs in ways that will cause confusion when sending them between processes and I'm not sure there would be a nice solution for that.
Race-free process creation in the GNU C Library
Posted Sep 16, 2023 14:35 UTC (Sat) by Jonno (subscriber, #49613) [Link]
Not quite. The first version of fd references to a pid was by open("/proc/«pid»", O_DIRECTORY) [or open("/proc/self", O_DIRECTORY)], giving you a directory fd that was guaranteed to never refer to an newer process, even if the pid was reused (it would instead refer to an unlinked directory). The problem was that this (1) required a mounted procfs to work, and (2), could not be used for polling or waitid. The upshot was that, being a directory fd, you could use it to open files in the procfs directory of the process in question.
To re-gain that ability without the old problems you need some race-free way of going from a pidfd to the corresponding dirfd without a mounted procfs. Simply getting a procfs reference for use in *at syscalls without actually mounting procfs (as proposed by Karellen) would make it possible for live processes, but not for exited processes still referred to by a pidfd, and it wouldn't be race-free. My proposal using openat, or some new flag to dup3 or fcntl, would solve it fully.
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 4:39 UTC (Sat) by iabervon (subscriber, #722) [Link]
It really seems like it would be sensible for the kernel to provide the information that's in /proc/self available to the process itself without access to procfs more generally or use of absolute paths. On the other hand, that's a separate issue from the pidfd stuff.
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 22:29 UTC (Fri) by alkbyby (subscriber, #61687) [Link]
Perhaps it would help if someone could elaborate more what are exact valid or semi-valid uses that are raceful currently (or article could be updated).
I.e. classic posix_spawn and wait should just work. Don't wait{,pid} for your child until you've grabbed it's pidfd and you have no race.
I can only see one special case which is, if parent ignores SIGCHLD then child exiting status is automatically collected, so wait{,pid} won't see it. There is no zombie stage and there is no pid to find. And then, indeed, we could use pidfd bits including this new API to handle this case which would otherwise be raceful. I am not sure how much demand for this case there is, since it "breaks" wait{,pid} anyways.
Or am I missing anything ?
Race-free process creation in the GNU C Library
Posted Sep 1, 2023 23:28 UTC (Fri) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 0:26 UTC (Sat) by alkbyby (subscriber, #61687) [Link]
But my point is as long is we're able to guarantee that child's pid is not reused, there is no race if/when parent calls whatever set_xyz on child's pid (it may find child dead, but it'll never confuse this child with another process). And classic mechanism of zombies gives us exactly that. Child's pid won't get reused until parent collects child's status.
P.S. Also I was under the impression that lot/most of those "many things" (setsid, unshare etc) are typically what child does for itself (after clone_vfork but before exec, for which posix_spawn has numerous attributes).
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 10:17 UTC (Sat) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 0:39 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]
First, it's not possible to handle SIGCHLD meaningfully in many environments (e.g. in a lot of scripting languages). Second, even with SIGCHLD handlers, you have to walk on a tightrope to have a truly race-free code. You can only wait() on processes in exactly one thread (likely in the main event loop), that has to execute exclusively with any other code that might operate on processes. So the only thing your handler can do safely is to kick the event loop to perform a waitid()/waitpid() check.
And forget about multithreading and composability. It's simply impossible to write fully correct multithreaded process management code.
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 1:41 UTC (Sat) by alkbyby (subscriber, #61687) [Link]
We might be misunderstanding each other, somehow. But what you said is untrue. A thread can easily posix_spawn sub-process and waitpid for it. Even from inside library. Yes if process does blanket wait() in some other thread it wont work, but this seems borked design to me. (Is that one of use-cases quoted by article? Is there non-trivial programs or libraries doing such a thing ?)
There are definitely libraries doing sub-process spawning. E.g. I recently learned tensorflow does to compile some hw accelerator codes.
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 2:15 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]
However, _any_ other wait/waitid() in the process can reap it, waits are not thread-scoped. So you can't have anybody in the process calling them. And if you ONLY do waitpid() calls, it might even be composable.
Except... you do have to call wait() periodically to avoid zombies, because your spawned process can die and reparent its children into your process.
Race-free process creation in the GNU C Library
Posted Sep 3, 2023 9:50 UTC (Sun) by roc (subscriber, #30627) [Link]
Race-free process creation in the GNU C Library
Posted Sep 4, 2023 3:32 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 2:55 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]
Then they are quite likely unsafe, though in practice they would work fine in the vast majority of cases because typical race windows are pretty narrow. You really need malicious input and/or users to exploit that.
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 4:07 UTC (Sat) by alkbyby (subscriber, #61687) [Link]
Well, your comment above about reparenting is only right for pid 1. And I am not sure how much software there is that "steals" other modules/libraries dead kids. My impression is there shouldn't be much.
I quickly inspected libuv for sub-process spawning, they don't steal. And glib. They also do the right thing (even with pidfd when available, since pidfd can be nicely polled).
With all that I am still curious what might the use-cases that people try to fix by proposed pidfd_spawn API. So far we've established it could be:
a) when process breaks wait{,pid} by ignoring SIGCHLD
b) when process has things that steal dead kids
But perhaps there are more. And I am curious how common those "bad" cases might be.
Why not fix process ids?
Posted Sep 2, 2023 6:32 UTC (Sat) by epa (subscriber, #39769) [Link]
Make process ids 64 bit and they can be unique for the lifetime of the system.
Why not fix process ids?
Posted Sep 2, 2023 7:04 UTC (Sat) by Subsentient (subscriber, #142918) [Link]
Why not fix process ids?
Posted Sep 2, 2023 14:23 UTC (Sat) by corbet (editor, #1) [Link]
That can, indeed, be done now by messing with /proc/sys/kernel/pid_max. Making bigger process ID's the default will always risk breaking applications, though.
Why not fix process ids?
Posted Sep 2, 2023 16:04 UTC (Sat) by pebolle (subscriber, #35204) [Link]
I could be misreading include/linux/threads.h, but since systemd on my (Fedora) system sets pid_max to that value out of the box I don't think I actually am.
Why not fix process ids?
Posted Sep 4, 2023 9:50 UTC (Mon) by mezcalero (subscriber, #45103) [Link]
That said the kernel max is 22bit or so iirc, i.e. far from 32 or even 64bit...
Why not fix process ids?
Posted Sep 4, 2023 10:12 UTC (Mon) by pebolle (subscriber, #35204) [Link]
That's correct (and thanks for confirming my reading of include/linux/threads.h).
Why not fix process ids?
Posted Sep 4, 2023 19:14 UTC (Mon) by adobriyan (subscriber, #30858) [Link]
It is simple to implement correct process killing. All programmer needs to do is to hold /proc/$pid descriptor while sending signal.
32 and 64-bitness doesn't change anything.
I've checked what htop does and it seems to do it wrong: it opens /proc/$pid then openat() few files from there but then closes directory.
41073 openat(3, "41057", O_RDONLY|O_NOFOLLOW|O_DIRECTORY) = 4
41073 openat(4, "task", O_RDONLY|O_NOFOLLOW|O_DIRECTORY) = 5
...
41073 close(5) = 0
...
41073 close(4)
...
41073 kill(41057, SIGTERM) = 0
Why not fix process ids?
Posted Sep 4, 2023 19:21 UTC (Mon) by adobriyan (subscriber, #30858) [Link]
If kill -TERM is done from /proc/$pid !
$ ./pause &
[1] 41956
$ cd /proc/41956
# double check it is the same process, VERY IMPORTANT
$ cat comm #cmdline
pause
# send signal WITHOUT LEAVING /proc/$pid (VERY IMPORTANT)
$ kill -TERM 41956
# ... and it's gone!
$ cat comm
cat: comm: No such process
Why not fix process ids?
Posted Sep 4, 2023 23:15 UTC (Mon) by mchapman (subscriber, #66589) [Link]
Between your "cat" and "kill" commands, the process could have exited, been reaped by its parent, and another process could have been forked with PID 41956. By the time you run kill, that PID may not be the same process you thought it was.
Simply holding a reference to the (old) /proc/$PID directory does not prevent the PID from being reused.
Why not fix process ids?
Posted Sep 14, 2023 14:50 UTC (Thu) by ksandstr (guest, #60862) [Link]
Another substitute solution to pidfds would make process IDs a capability of sorts, such that they're created by fork/spawn, transferred to other processes by unspecified means[0], and invalidated at wait() so they subsequently raise an error upon use. This would ensure that stale PIDs, being those that refer to a since-deceased process, don't end up referring to a different process. However the cost of doing this is a slight API break because kill() etc. would raise "unknown PID" while that PID might actually have come to exist again. Also the question of validating such a capability from e.g. command line parameters will need an answer.
Considering that any use of a PID is an instant TOCTOU hazard to any but the parent process (because it's the only one that can call wait() on that PID), the idea of "just fix the call sites" can be recognized unworkable in a great many cases. Analoguously to the capability idea above, pidfds provide a process-local identifier in the file descriptor[1] and a means to communicate process termination at time of use. And their cost isn't even an ABI break -- just that the old API will be creaky and the new API will be both nonportable, so extensive as to cover every POSIX call that takes a pid_t, and any pidfd_getpid() band-aid call will be another instant TOCTOU hazard (unless). Out of these approaches, pidfds certainly seem like an attractive solution since they mainly require lots of footwork and the creation of a "pre-horizon" category of vulnerable programs that process PIDs in any way.
[-1] this one would be soluble by invalidating wrapped PIDs in processes whose lifecycle intersects the wraparound point, another mild API break and perhaps the bane of init(8) in an interstellar probe or something.
[0] perhaps a general two-stage mechanism to validate a PID and then confirm its correct identity (using e.g. the program's fsid/inode# pair), or an unix domain socket faff not unlike fd transfer.
[1] though transferring these to another process would seem to require a unix domain socket between the two.
[2] there is no 4th footnote; I'm just using this space to point out that I'm currently unemployed but capable of spitting out this kind of off-hand analysis, and a suitably impressed reader's employer could almost certainly use a mad lad like me. *wink* *wink*
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 7:02 UTC (Sat) by ibukanov (subscriber, #3942) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 10:14 UTC (Sat) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 10:57 UTC (Sat) by darmengod (subscriber, #130659) [Link]
But when it sends the pidfd to some other process, the receiver has no way to get the PID (number) without /proc being accessible, and that is undesirable.
Couldn't this be resolved by convention at the application layer? So the original parent process doesn't just send an empty message with SCM_PIDFD attached, but includes the PID as a number in the regular message payload?
Race-free process creation in the GNU C Library
Posted Sep 2, 2023 11:09 UTC (Sat) by bluca (subscriber, #118303) [Link]
Race-free process creation in the GNU C Library
Posted Sep 4, 2023 0:46 UTC (Mon) by njs (subscriber, #40338) [Link]
As a few other comments noted, it's already possible for parents to avoid the race condition when spawning a child and getting a pidfd to it, because pids aren't recycled when the child dies – they're recycled after the parent affirmatively calls one of the wait variants to reap the exit status. So "just" call pidfd_open before you call wait, problem solved.
But this is still useful for a few reasons:
- "make sure nothing in your program calls wait, or else a very obscure issue could happen one time in a million" is certainly an invariant you *can* enforce, but it sure is easier and less error prone if you don't have to.
- if you're writing a reusable library, you don't know what other code will be running in the same process as you. You might prefer to be robust against being used by poorly implemented callers, that do things like call wait on everything.
- if you're writing a highly backwards compatible library you *can't* add undocumented, observable side effects to your operations, even if the only code that would notice is arguably broken. Corollary: right now these libraries cannot move existing functionality out into helper processes, even if this would be eg better for security, and even if the user-visible API stays exactly the same. If mylib_do_foo() starts secretly spawning a child, that fact can't be encapsulated, because it will leak out into the process-global child monitoring APIs like SIGCHLD and wait.
But, there's an even more obscure Linux feature that can solve this: if you pass exit_signal=0 to clone, then the child process is hidden from not just SIGCHLD but also wait (!). Technically I think this is orthogonal to the pidfd stuff, but they're very convenient to use together, so it makes sense that a new interface exposing exit_signal=0 would also return a pidfd instead of a pid.
... Unfortunately the proposed patch only adds support for this to fork(), not to posix_spawn(), and the fork() support is controversial in general. But hopefully it'll get revised so we end up with exit_signal=0 *and* pidfd support in posix_spawn.
(It would also be nice if you could arrange that a child with exit_signal=0 and CLONE_PIDFD would be automatically orphaned when the pidfd was closed, since regular SIGCHLD reaping won't work on it. But that would be a whole other kernel patch.)
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 3:02 UTC (Tue) by wtarreau (subscriber, #51152) [Link]
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 3:15 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
It also does not help at all if the process in question is not your child.
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 9:11 UTC (Tue) by wtarreau (subscriber, #51152) [Link]
sigchld_handler()
{
lock(pidlock);
pid = wait(NULL);
reap_child(pid);
unlock(pidlock);
}
signal_child(int child, int sig)
{
lock(pidlock);
pid = get_pid_from_child(child);
if (pid)
kill(pid, sig);
unlock(pidlock);
}
I'm sorry but I continue to think the problem is mostly made up.
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 16:01 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
Can you find a single library that starts subprocesses, that has hooks for these kinds of locks? That's what I mean by "not composable".
Also, having to do such lock dances is an indication of a bad API in itself.
Race-free process creation in the GNU C Library
Posted Sep 6, 2023 14:02 UTC (Wed) by wtarreau (subscriber, #51152) [Link]
I don't know, since I don't know what such libs currently do. But it would seem like the correct thing to do if they claim to be thread-compatible.
> Also, having to do such lock dances is an indication of a bad API in itself.
If necessary it could be wrapped into a simpler API. But the locks are precisely due to a race which is inherent to process reaping/signaling that can be happening in parallel and that one needs to serialize. I don't see why one must suddenly start to make an exception for this specific case and say "let's pretend there is no race here so that we can save one lock" nor "let's assume programmers creating threads don't understand the limits of threads". I would, however, clearly welcome an in-libc pair of wrappers that just adds these locks around wait() and kill() such as locked_wait() and locked_kill() to be more friendly to the user and to lib developers. But my feeling is that if it's just for this, it's becoming overkill, and the fact that it started a discussion seems to indicate others have the same feeling.
Race-free process creation in the GNU C Library
Posted Sep 6, 2023 15:22 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]
No. You need to wrap _user_ code in locks. It's not just functions themselves. I.e.:
You have to write ALL process-related code in this manner:
1. get_lock
2. pid = create_process()
3. verify_pid_is_correct(pid)
4. kill(pid, 9)
5. release_lock()
This can't be wrapped into simple locked functions, unless you want to have a closure-based API. And even then you'll have all the locking-related issues, like deadlocks.
In short, the classic process API is inherently broken in the presence of threads. It can't be sanely fixed.
Race-free process creation in the GNU C Library
Posted Sep 6, 2023 20:53 UTC (Wed) by wtarreau (subscriber, #51152) [Link]
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 0:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
I consider having to do locks in sometimes inconvenient places to block "spooky action at a distance" the very definition of brokenness.
Also, there's still a case where you might need to do operations (e.g. send signals) to processes that are not your children.
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 4:00 UTC (Thu) by wtarreau (subscriber, #51152) [Link]
If so, absolutely everything involving threads or communication with other processes is broken. I'm sorry but I disagree with this definition.
> Also, there's still a case where you might need to do operations (e.g. send signals) to processes that are not your children.
Yes, and this has always shown a moderate reliability only. That's the classical "ps auxw" then "kill $pid". I don't see what in the proposed API could improve this situation at all.
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 4:05 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Imagine that memory allocations worked the same way. Instead of getting a pointer, you get a "zone ID" with the same semantics as PIDs.
Sorry, but plenty of interfaces are well-designed and work just fine with threads. In libc: memory allocations, file operations, IPC primitives, etc.
To be fair, libc also has a plenty of other broken interfaces: the notion of the current directory, non-reentrable functions, the whole mess with locales and timezones.
> I don't see what in the proposed API could improve this situation at all.
Uhm... You get a pidfd and you can use it to make sure that the PID won't be reused while at least one pidfd descriptor is open. This makes it possible to do race-free process manipulation.
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 18:26 UTC (Thu) by wtarreau (subscriber, #51152) [Link]
Actually you gave a pretty good example, because memory allocations *do* work the same way. If one thread tries to access a memory location while another one is freeing it and without coordinating together, you'll pretty quickly see either a use-after-free bug or a basic segfault.
> You get a pidfd and you can use it to make sure that the PID won't be reused while at least one pidfd descriptor is open. This makes it possible to do race-free process manipulation.
I'm just seeing it as convenience at the expense of extra FDs, which may in some cases result in new classes of bugs such as leaks if some FDs are passed by accident or just lost without being closed or stuck into a UNIX socket but closed so that nobody sees it, and even possibly vulnerabilities later if accessing such an FD is possible and is sufficient to send a signal over it despite the processes not being supposed to be able to interact.
Don't get me wrong, I'm not saying it's bad, we all love when some APIs are made easier to use or open new possibilities. It's just that I don't feel like this was that difficult to use correctly and that the small extra efforts probably did not warrant the possible classes of issues that will inevitably come with it. Time will tell.
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 18:39 UTC (Thu) by bluca (subscriber, #118303) [Link]
Then you still haven't quite grasped what the actual problems being solved here are, and it might be time to go look at the sources linked in the article before further commenting, especially the cover letters and the linked bugzillas
Race-free process creation in the GNU C Library
Posted Sep 7, 2023 19:31 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Now imagine that allocation can be freed at any moment.
> I'm just seeing it as convenience at the expense of extra FDs,
FDs are not a scarce resource.
> which may in some cases result in new classes of bugs such as leaks if some FDs are passed by accident
How is that different from any other FDs?
> or just lost without being closed
Don't lose resources.
> or stuck into a UNIX socket but closed so that nobody sees it
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 9:46 UTC (Tue) by bluca (subscriber, #118303) [Link]
No, it is not, this is about tracking any process from any other process in a race-free manner end-to-end, as dbus/polkit/systemd clients/etc need to do.
Race-free process creation in the GNU C Library
Posted Sep 5, 2023 9:10 UTC (Tue) by tlamp (subscriber, #108540) [Link]
The command arguments name for file/path seem to be switched for pidfd_spawn
and pidfd_spawnp
?
As for posix_spawn
the parameter name path
is used (i.e., relative or absolute) and for posix_spawnp
the parameter name file
is used (i.e., a filename that is looked up through PATH environment variable).
See https://manpages.debian.org/bookworm/manpages-dev/posix_spawn.3.en.html.
So shouldn't the signatures look like:
int pidfd_spawn(int *restrict pidfd, const char *restrict path, const posix_spawn_file_actions_t *restrict facts, const posix_spawnattr_t *restrict attrp, char *const argv[restrict], char *const envp[restrict]); int pidfd_spawnp(int *restrict pidfd, const char *restrict file, const posix_spawn_file_actions_t *restrict facts, const posix_spawnattr_t *restrict attrp, char *const argv[restrict_arr], char *const envp[restrict_arr]);
Race-free process creation in the GNU C Library
Posted Sep 14, 2023 19:03 UTC (Thu) by the8472 (guest, #144969) [Link]
Open a unix socket pair, fork, pidfd_open in the child, send the fd to the parent, do other process setup stuff, exec. We already had a communication channel to the parent anyway for error handling, previously it was a pipe, so it wasn't a big change.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK