Race-free process creation in the GNU C Library

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

The pidfd API has been added to the kernel over the last several years to provide a race-free way for processes to refer to each other. While the GNU C Library (glibc) gained basic pidfd support with the 2.36 release in 2022, it still lacks a complete solution for race-free process creation. This patch set from Adhemerval Zanella seems likely to fill that gap in the near future, though, with an extension to the posix_spawn() API.

Unix systems refer to processes via an integer ID (the "process ID" or PID) that is assigned at creation time. The problem with PIDs is that they are reused over time; once a process with a given PID has exited and been reaped, that PID can be assigned to a new and unrelated process with the result that any given PID might not, in fact, refer to the process that the user thinks it does. To address this problem, the pidfd concept was introduced; a pidfd is a file descriptor that acts as a handle for a process. The process associated with a pidfd can never change, so many of the race conditions associated with PIDs do not exist with pidfds.

Current glibc releases include wrappers for a number of the low-level pidfd-related system calls, including pidfd_open(), pidfd_getfd(), and others. There is one piece missing, though: the ability to obtain a pidfd for a new process as that process is created. It is possible to use pidfd_open() to get a pidfd from a PID immediately after creation, but that still leaves a narrow window during which the process identified by a PID could exit and be replaced by another. Closing that window requires obtaining a pidfd from the kernel as a result of creating a new process, and glibc provides no way to do that.

That functionality could be provided by adding a wrapper for the clone3() system call, but there is some resistance to doing that. Instead, Zanella has taken the approach of enhancing the posix_spawn() API, which is seen by many as being a better approach to process creation (when immediately followed by an exec() call) than the Unix fork() model. The result is two new functions:

    int pidfd_spawn(int *restrict pidfd,
                    const char *restrict file,
                    const posix_spawn_file_actions_t *restrict facts,
                    const posix_spawnattr_t *restrict attrp,
                    char *const argv[restrict],
                    char *const envp[restrict]);

    int pidfd_spawnp(int *restrict pidfd,
                     const char *restrict path,
                     const posix_spawn_file_actions_t *restrict facts,
                     const posix_spawnattr_t *restrict attrp,
                     char *const argv[restrict_arr],
                     char *const envp[restrict_arr]);

Just like posix_spawn() and posix_spawnp(), these functions execute a combination of clone() and exec() to create a new process running the program indicated by file or path. The return value, though, will be a pidfd identifying the created process rather than a PID.

If the creator needs to know the new process's PID, that can be obtained by a new function added by the patch set:

    pid_t pidfd_getpid(int pidfd);

This function obtains the PID by looking at the /proc entry for the given pidfd.

The new functions are implemented with clone3() to obtain the pidfd during process creation, without a race window. Using clone3() makes some other things possible as well, specifically creating the new process in a different control group than the creator's. Zanella has made this capability available as well, via an extension to the posix_spawn() attribute mechanism. Creating into a different control group is available for posix_spawn() as well as pidfd_spawn().

While posix_spawn() is seen by many as a better model for the combination of fork() and exec(), it does not provide all of the functionality that is available. For cases where this API is not sufficient, earlier versions of the patch set included a function called fork_np() as a separate wrapper around clone3() that would return a pidfd identifying the new child process. Florian Weimer complained that this interface differs too much from what the kernel provides, though, and is "not future-proof at all". He asked Zanella to leave this function out of the series for now, and it has been duly removed from later versions of the series.

Rich Felker, instead, objected to the concept in general, claiming that any PID-related races are "purely programmer error" and that "making a new, complex, highly nonstandard interface to work around a problem that's programmer error, and getting this nonstandard and nonportable pattern into mainstream software, has negative value". It would be better, he said, to fix the software affected by this problem. Luca Boccassi disagreed, though, saying that "these are real race conditions, that cannot be solved otherwise". Weimer also said that there was value in introducing the pidfd functionality.

While there has been no definitive resolution to this particular disagreement, the fact remains that PID races can be a problem, and there are users (such as systemd) that would like to have this type of API to avoid those races. It thus seems reasonably likely that pidfd_spawn() (though perhaps not fork_np()) will eventually find its way into glibc.

(Log in to post comments)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 14:54 UTC (Fri) by bluca (subscriber, #118303) [Link]

Very much looking forward to have this available, the combination of race-free pidfd plus race-free spawn in the target cgroup is sorely needed. With this, and the recent SCM_PIDFD/SO_PEERPIDFD addition to kernel v6.5, we are moving towards using pid fds end-to-end across systemd, dbus (-broker/-daemon) and polkit, and any of their clients.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 15:56 UTC (Fri) by mb (subscriber, #50428) [Link]

> This function obtains the PID by looking at the /proc entry for the given pidfd.

Oh no, please not yet another fundamental thing that depends on /proc being mounted.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 18:15 UTC (Fri) by dwest (subscriber, #110523) [Link]

Could you explain the objection to proc? I haven't heard any other complaints about it so I'm curious about whether this is some larger complaint that I've managed to miss entirely...

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:11 UTC (Fri) by mb (subscriber, #50428) [Link]

Well, the problem is that proc is not always available. e.g. in chroots or containers.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:21 UTC (Fri) by bluca (subscriber, #118303) [Link]

containers really should have it, and chroots - I can't imagine services tracking processes such as dbus or polkit or systemd would be running in a chroot? The way polkit/dbus do it now relies on parsing /proc anyway, so it wouldn't make much of a difference in that regard

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:27 UTC (Fri) by mb (subscriber, #50428) [Link]

>containers really should have it

One additional nail into the coffin of unprivileged containers?

>The way polkit/dbus

I'm talking about the fundamental pidfd API. Any process could use pidfds.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:35 UTC (Fri) by bluca (subscriber, #118303) [Link]

> One additional nail into the coffin of unprivileged containers?

I'm pretty sure those can have /proc too?

$ id -u
1000
$ unshare -U -m --mount-proc -p -f
$ mount | grep img
proc on /tmp/img type proc (rw,nosuid,nodev,noexec,relatime)

> I'm talking about the fundamental pidfd API. Any process could use pidfds.

Sure, to do process tracking - what kind of process would you need to track in a chroot? Besides, it's all moot, this is not glibc's fault, the kernel provides this interface, so that's what glibc can use to provide an abstraction

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:36 UTC (Fri) by bluca (subscriber, #118303) [Link]

(copy-pasta, that should have been --mount-proc=/tmp/img - give us an edit button already!)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:57 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

> what kind of process would you need to track in a chroot

Any process that wants to spawn a process and use pidfd, but also write the pid in a log file or debug trace? Ignoring portability for a second, it could even be something like make or cargo.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:19 UTC (Fri) by bluca (subscriber, #118303) [Link]

That requires procfs to do today, no? So there shouldn't be a regression in that regard?

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:30 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

It doesn't require procfs if it uses the (inferior) pid-based API and SIGCHLD. So it's a regression if this hypothetical program wants to switch to pidfd. A ioctl does seem to be a good idea, it can return ESRCH in case of a race.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:23 UTC (Fri) by bluca (subscriber, #118303) [Link]

Ok - sounds like those use cases need to make a choice: continue to use pid-based APIs and no procfs, or switch to pidfds and mount procfs with hidepid= to sandbox it

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:46 UTC (Fri) by josh (subscriber, #17465) [Link]

Or bypass glibc and call the nice race-free function the kernel provides, and continue advocating that glibc provide clone3.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 0:43 UTC (Sat) by bluca (subscriber, #118303) [Link]

The kernel doesn't provide functions to resolve pidfds

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:08 UTC (Sat) by josh (subscriber, #17465) [Link]

Given access to clone3, you can directly obtain a pidfd and a pid simultaneously when you first create the process, rather than retrieving the pid later.

(That operation would still be useful when passed a pidfd from elsewhere, but not *necessary* for the common case where you got the pidfd by creating a process.)

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:37 UTC (Sat) by bluca (subscriber, #118303) [Link]

The case when you want to resolve a pidfd received via SO_PEERPIDFD/SCM_PIDFD is exactly where you need that, and what is enabled by all these new APIs that have recently been added, and where this resolving glibc function. I know because I had to reimplement it across 4 projects...

Race-free process creation in the GNU C Library

Posted Sep 3, 2023 4:14 UTC (Sun) by IanKelling (subscriber, #89418) [Link]

> So it's a regression if this hypothetical program wants to switch to pidfd.

I don't think it is hypothetical. From my sysadmin perspective, I often build software in a chroot without a /proc mount. Very rarely, the build has needed it and I wanted to know why. Bind bounding /proc, I see find shows 546,160 user-listabable files and 304,803 user readable files. Making that a requirement to create processes just because opt-in to an api that avoids a race condition would be roughly a regression in my book.

Race-free process creation in the GNU C Library

Posted Sep 3, 2023 10:26 UTC (Sun) by bluca (subscriber, #118303) [Link]

Why would compiling some stuff require resolving pidfds?

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 9:16 UTC (Mon) by taladar (subscriber, #68407) [Link]

Why wouldn't it? Compiling spawns lots of processes and that kind of thing usually involves printing the PID when logging what you are doing to be able to distinguish between different instances of the same program (e.g. the compiler when spawned by some sort of build tool).

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 9:53 UTC (Mon) by bluca (subscriber, #118303) [Link]

Then the tools that spawn such processes, if they want to implement tracking by pidfd, will need to implement appropriate fallbacks (which are easy to add as the error codes are different). They'll need that anyway for compatibility with older kernels. So still not sure where the regression would be?

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:07 UTC (Fri) by geofft (subscriber, #59789) [Link]

There's a practical problem that a Kubernetes container that is not marked "privileged" (which is a Kubernetes concept, rather different from the ordinary meaning of "privileged" as in "runs as root") gets certain things in /proc overmounted, e.g., /proc/sysrq-trigger and /proc/kcore, as a form of sandboxing. The goal is to reduce the impact of a malicious pid 0 inside a container. (User namespacing would also work, but most Kubernetes deployments don't use it yet - it's an alpha feature on k8s' end and only supports one container runtime.) This is, in isolation, an understandable / defensible feature, and I can see systems other than Kubernetes doing it (e.g., I can totally see it being a systemd Restrict option down the line).

Meanwhile, the kernel has a feature where, if your current /proc is in any way overmounted, you're not allowed to mount a new /proc - because that would give you access to the files that are supposed to be hidden to you. This is also, in isolation, an understandable / defensible feature.

The intersection of these features is that you can't correctly mount /proc inside a nested container or container-like thing inside a non-privileged Kubernetes container. If you make a new pidns (either because you're root or via a new userns, as in your example), all the paths in /proc are wrong because they refer to outer PIDs.

(The intersection of these features also ceases to be really defensible in the case where you don't allow your Kubernetes workloads to run as pid 0, which is a really good idea on its own.)

There have been some patches for a second procfs (whose exact name I'm forgetting) that provides /proc/$pid/ and the /proc/self/ symlink but not anything else in /proc, but I don't think they've been merged. If those could get merged and guaranteed mountable by anyone with CAP_SYS_MOUNT in their current namespace, regardless of what the existing /proc outside it looks like or even whether it exists, that would satisfactorily address the issue.

I suppose another option would be for /proc to always enumerate the calling process's PID namespace, but maybe that gets weird with open file descriptors passed between PID namespaces.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:28 UTC (Fri) by bluca (subscriber, #118303) [Link]

Isn't that what the hidepid= mount options (and systemd's ProtectProc=) do? To resolve pidfds you just need proc/self/fd/ and proc/self/fdinfo which are both available under those sandboxing options

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:56 UTC (Sat) by cyphar (subscriber, #110703) [Link]

subset=pids has no effect on the mount_too_revealing() check because all of the "are the flags the same" checks are based on the generic VFS flags not FS-specific ones. So if you only have an overmounted procfs you cannot mount subset=pids even if the overmounts are paths that don't exist with subset=pids.

In fact this also means you can bypass the check entirely -- if you have a "safe" subset=pids mount in your namespace, the kernel will allow you to mount an unmasked (fully-fledged) procfs.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:24 UTC (Fri) by wahern (subscriber, #37304) [Link]

procfs has historically been a reliable vector for exploits, both for the kernel and applications. Principle of least privilege says that if you don't need procfs, don't mount procfs. But if libc relies on procfs for basic features, you necessarily have to expose many other parts (if not necessarily all) of procfs even beyond those strictly needed for those features.

Moreover, procfs requires opening descriptors. But what if you've already hit your descriptor limit? Now rather than getting EMFILE, you get unexpected errors from syscall wrappers. And to avoid descriptor leaks, libc has to go through herculean efforts to make the syscall wrapper async- and thread-safe, and those efforts are definitely not always bug-free; or alternatively, now there's another threading/fork foot gun laying around.

None of these issues may be of concern to *you*, but they're of concern to other people, and have been for decades. Moreover, PID fds is an interface which people concerned about reliability, correctness, and security, have been desiring for a long-time; PID fd usability being tied to procfs substantially reduces the net value. Not all process management can be shoe-horned into systemd and other global services; far from it. Process management is often something ones needs to perform *after* dropping various privileges. That not all privilege separating or privilege reducing tasks can be performed immediately before or after exec, or cannot be reduced to one-line configuration directives, is precisely why OpenBSD's pledge and unveil are infinitely more ergonomic than comparable Linux solutions.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:48 UTC (Fri) by bluca (subscriber, #118303) [Link]

> procfs has historically been a reliable vector for exploits, both for the kernel and applications. Principle of least privilege says that if you don't need procfs, don't mount procfs. But if libc relies on procfs for basic features, you necessarily have to expose many other parts (if not necessarily all) of procfs even beyond those strictly needed for those features.

Nah, procfs supports various sandboxing features nowadays, and especially when unprivileged it necessarily implies a pid namespace so you do not have visibility in the rest of the system, only on processes in your pid namespace, and if it's a chroot that's going to be just the shell. If you are privileged, you can use the ProtectProc= systemd option (or if you are running on the 0.000x% of Linux install, mount /proc with the various hidepid= options that provide equivalent functionality)

> Moreover, procfs requires opening descriptors. But what if you've already hit your descriptor limit?

The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.

> PID fd usability being tied to procfs substantially reduces the net value.

Considering they've been available as-is for 4 years and nobody bothered to do anything about that, and have been providing great net value in the meanwhile, I'll have to take that with a grain of salt.

> Process management is often something ones needs to perform *after* dropping various privileges.

Not sure what that has to do with using procfs?

> is precisely why OpenBSD's pledge and unveil are infinitely more ergonomic than comparable Linux solutions.

I mean, if you dislike modern Linux so much and prefer OpenBSD, then just use OpenBSD? That's an absolutely fine thing to do.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:34 UTC (Fri) by bbockelm (subscriber, #71069) [Link]

> The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.

Oh, the youthful banter of someone who hasn't spent a few hours this week debugging issues caused by file descriptor exhaustion!

(In this case, it was due to a hypervisor that booted a VM with trivial amounts of memory, the VM kernel adjusted system-wide file descriptor limits down accordingly, then the hypervisor would hotplug another 32GB of RAM later...)

For what it's worth, I agree this _should_ have been a problem relegated to history. I want to live in the future!

Race-free process creation in the GNU C Library

Posted Sep 6, 2023 8:39 UTC (Wed) by lathiat (subscriber, #18567) [Link]

I debugged the exact same issue, with proxmox having "Memory Balooning" enabled. Despite having a "minimum memory" of 16GB, it would boot with 1GB and plug the rest in later. This gave you very low maximum number of processes on the system giving fork: retry: Resource temporarily unavailable errors inside a Docker container.

I found the following very low Default:

# systemctl show --property=DefaultTasksMax
DefaultTasksMax=981

Which you also see in cgroupfs:
find /sys/fs/cgroup -name pids.max -exec grep -H . {} ;

The systemd docs state this is set based on threads-max "Configure the default value for the per-unit TasksMax= setting. See systemd.resource-control(5) for details. This setting applies to all unit types that support resource control settings, with the exception of slice units. Defaults to 15% of the minimum of kernel.pid_max=, kernel.threads-max= and root cgroup pids.max. Kernel has a default value for kernel.pid_max= and an algorithm of counting in case of more than 32 cores. For example with the default kernel.pid_max=, DefaultTasksMax= defaults to 4915, but might be greater in other systems or smaller in OS containers."

We then find a very low /proc/sys/kernel/threads-max of 6541. According to the kernel docs "During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages."

Despite being a pretty experience Linux performance engineer it took me a bit to find that one, as it only showed up in the cgroup limits and not in /proc/PID/limit.

Good times :)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:56 UTC (Fri) by dezgeg (subscriber, #92243) [Link]

> The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.

Is that really common to have no ulimit for them? 1024 fds limit has been very typical what I've seen (since default FD_SET size is that, so most programs that use select() will break on high fds)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:29 UTC (Fri) by bluca (subscriber, #118303) [Link]

That's the default soft limit yes, but since many years the default hard limit is the highest the kernel can give out. So a process defaults to 1024 to avoid breaking the legacy select() interfaces, but can raise the soft limit at will in case it doesn't use those interfaces, which I'd hope it's most things these days.

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 20:18 UTC (Mon) by comex (subscriber, #71521) [Link]

However, a library probably shouldn’t assume that the program it’s linked into isn’t using select(), or try to raise the soft limit itself.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:50 UTC (Fri) by josh (subscriber, #17465) [Link]

> Considering they've been available as-is for 4 years and nobody bothered to do anything about that,

People have been bothering to do something about that, and it has taken this long to get something on a potential path to acceptance.

It's the fault of libc that we cannot simply call clone3 directly. It's the responsibility of libc to *stop hiding the underlying useful functionality* just because it thinks it knows better.

Race-free process creation in the GNU C Library

Posted Nov 14, 2023 23:57 UTC (Tue) by Rudd-O (guest, #61155) [Link]

Excellent arguments. Thank you. Yuck to requiring more procfs in basic libc stuff.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 22:31 UTC (Sat) by DemiMarie (subscriber, #164188) [Link]

Sandstorm deliberately does not mount /proc for sandboxing reasons.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 6:37 UTC (Tue) by fw (subscriber, #26023) [Link]

Somewhat recently, systemd-nspawn removed proc support from nested chroots: https://bugzilla.redhat.com/show_bug.cgi?id=2210335

So it's unfortunately not the case that proc is universally available or can be made so.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 18:28 UTC (Fri) by bluca (subscriber, #118303) [Link]

I'm not convinced using proc is a problem, but in any case, that's the only interface there is to resolve pidfds, there is no alternative

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:32 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Perhaps it should be added? It can be a very simple syscall. Or even an ioctl.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:36 UTC (Fri) by bluca (subscriber, #118303) [Link]

Thanks for volunteering! :-P

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:10 UTC (Sat) by josh (subscriber, #17465) [Link]

The alternative is to obtain the pidfd and pid simultaneously when calling clone3.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:38 UTC (Sat) by bluca (subscriber, #118303) [Link]

Doesn't work when you receive a pidfd from somewhere else

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 12:05 UTC (Tue) by hmh (subscriber, #3838) [Link]

Yes, a pidfd_getpid() *syscall* is clearly missing. It might not have been really needed before (modulo system libraries lacking appropriate (if non-portable) functionality to actually return you the pidfd *and* pid on clone and/or fork), but now that you can send pidfds around the system over sockets, a syscall is clearly very desirable, it seems to be the best way to solve the underlying problem.

While it looks at first glance that it would be "easy" to write one, that's for someone already used to working in that area of the kernel -- there are likely permission checks one need to get perfectly right to not create a security mishap, namespace concerns, etc. Experience in the specific area of the kernel you're working with almost always help a lot on the quality of the first public version of a patch, and faster acceptance in mainline for non-controversial changes.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 12:29 UTC (Tue) by bluca (subscriber, #118303) [Link]

Sure, but that's really nothing to do with glibc and its developers/maintainers, it's something an experienced kernel developer would be in the best position to implement, as you noted. I mean the proposed glibc API could even transparently switch to such a syscall, if/when it becomes available in the future.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:48 UTC (Fri) by Karellen (subscriber, #67644) [Link]

I wonder if there would be any value in a system call that does the equivalent of open("/proc", O_PATH|O_DIRECTORY|O_CLOEXEC) and return an fd to the proc filesystem - even if /proc is mounted elsewhere or not at all? And similarly for /dev and /sys?

Then again, if admins wanted to limit access to those filesystems for a container, they'd need to implement some kind of seccomp-bpf/pledge style block, instead of just... not mounting those filesystems in the container.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:26 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Putting my SRE/sysadmin hat on for a moment: If an application wants to read from procfs, it should just try to open /proc. If that doesn't work, it's the sysadmin's problem, not the application's problem. It is far too late to "upgrade" all extant applications to call open_proc() (or whatever name you prefer) instead of open("/proc/...") directly, so introducing open_proc and then saying "use seccomp if you want to lock it down" is just giving sysadmins another knob we have to twiddle, for no discernable benefit. We'll still have to deal with open("/proc") compat. issues for old applications anyway (unless you propose that libc should somehow intercept those calls and redirect them to open_proc, which is IMHO insane).

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 2:03 UTC (Sat) by cyphar (subscriber, #110703) [Link]

This does kind of exist with fsopen(), but it requires privileges and because it is a mount you are at the whims of the mount_too_revealing() checks, which means it won't work in containers or in a namespace where there is no proc mount at all.

I have wondered whether it would be possible to allow fsopen("proc") to unprivileged processes but only for subset=pids -- this would solve many hacks needed in container runtimes to defend against certain attacks. Unfortunately, I suspect that even the new mount infrastructure is probably not going to be considered safe for unprivileged users to touch.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 9:06 UTC (Thu) by Jonno (subscriber, #49613) [Link]

> I have wondered whether it would be possible to allow fsopen("proc") to unprivileged processes but only for subset=pids

That would still let the unprivileged process learn of other processes on the system that it otherwise would be oblivious about.

But perhaps allowing something like `openat(pidfd, ".", O_DIRECTORY)` to get a fd equivalent to the /proc/<pid> directory except you can't ".." out of it would work.

Race-free process creation in the GNU C Library

Posted Sep 9, 2023 5:03 UTC (Sat) by cyphar (subscriber, #110703) [Link]

v1 pidfds kind of worked this way, my understanding is that there were a bunch of issues with creating handles to procfs mounts and thus only a few pidfd operations work with that style -- the new ones are all anonymous inodes (like most other fd interfaces).

It's a bit of a shame, because that could've been the nicest behaviour -- though the contents of quite a few procfs files depend on the pid namespace associated with the procfs in ways that will cause confusion when sending them between processes and I'm not sure there would be a nice solution for that.

Race-free process creation in the GNU C Library

Posted Sep 16, 2023 14:35 UTC (Sat) by Jonno (subscriber, #49613) [Link]

> v1 pidfds kind of worked this way
Not quite. The first version of fd references to a pid was by open("/proc/«pid»", O_DIRECTORY) [or open("/proc/self", O_DIRECTORY)], giving you a directory fd that was guaranteed to never refer to an newer process, even if the pid was reused (it would instead refer to an unlinked directory). The problem was that this (1) required a mounted procfs to work, and (2), could not be used for polling or waitid. The upshot was that, being a directory fd, you could use it to open files in the procfs directory of the process in question.

To re-gain that ability without the old problems you need some race-free way of going from a pidfd to the corresponding dirfd without a mounted procfs. Simply getting a procfs reference for use in *at syscalls without actually mounting procfs (as proposed by Karellen) would make it possible for live processes, but not for exited processes still referred to by a pidfd, and it wouldn't be race-free. My proposal using openat, or some new flag to dup3 or fcntl, would solve it fully.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 4:39 UTC (Sat) by iabervon (subscriber, #722) [Link]

One non-obvious thing is that the use of proc in pidfd_getpid is not about the other process. It's actually that the process that has a pidfd for some other process opens something under /proc/self in order to get miscellaneous further information about one of its own file descriptors, in fdinfo/(n).

It really seems like it would be sensible for the kernel to provide the information that's in /proc/self available to the process itself without access to procfs more generally or use of absolute paths. On the other hand, that's a separate issue from the pidfd stuff.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:29 UTC (Fri) by alkbyby (subscriber, #61687) [Link]

Slightly click-bait-fil title IMHO. Definitely nerd-sniped me into "someone on the Internet might be wrong" reaction. (No crime in that of course)

Perhaps it would help if someone could elaborate more what are exact valid or semi-valid uses that are raceful currently (or article could be updated).

I.e. classic posix_spawn and wait should just work. Don't wait{,pid} for your child until you've grabbed it's pidfd and you have no race.

I can only see one special case which is, if parent ignores SIGCHLD then child exiting status is automatically collected, so wait{,pid} won't see it. There is no zombie stage and there is no pid to find. And then, indeed, we could use pidfd bits including this new API to handle this case which would otherwise be raceful. I am not sure how much demand for this case there is, since it "breaks" wait{,pid} anyways.

Or am I missing anything ?

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:28 UTC (Fri) by bluca (subscriber, #118303) [Link]

There are many, many things you might want to do with a process before it exits and you can wait on it. For example, establish identity and authenticate it via polkit.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 0:26 UTC (Sat) by alkbyby (subscriber, #61687) [Link]

Maybe I didn't get what you're referring to. Perhaps you can elaborate on a specific case.

But my point is as long is we're able to guarantee that child's pid is not reused, there is no race if/when parent calls whatever set_xyz on child's pid (it may find child dead, but it'll never confuse this child with another process). And classic mechanism of zombies gives us exactly that. Child's pid won't get reused until parent collects child's status.

P.S. Also I was under the impression that lot/most of those "many things" (setsid, unshare etc) are typically what child does for itself (after clone_vfork but before exec, for which posix_spawn has numerous attributes).

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 10:17 UTC (Sat) by bluca (subscriber, #118303) [Link]

You are only thinking about basic process management. There is much more to it, like for example identifying for the purpose of authenticating a service or a session (polkit, dbus, logind, gnome). pidfds can be transferred to other arbitrary processes via SCM_PIDFD. You need to do that before you wait for their exit, for obvious reasons.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 0:39 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> I can only see one special case which is, if parent ignores SIGCHLD then child exiting status is automatically collected

First, it's not possible to handle SIGCHLD meaningfully in many environments (e.g. in a lot of scripting languages). Second, even with SIGCHLD handlers, you have to walk on a tightrope to have a truly race-free code. You can only wait() on processes in exactly one thread (likely in the main event loop), that has to execute exclusively with any other code that might operate on processes. So the only thing your handler can do safely is to kick the event loop to perform a waitid()/waitpid() check.

And forget about multithreading and composability. It's simply impossible to write fully correct multithreaded process management code.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:41 UTC (Sat) by alkbyby (subscriber, #61687) [Link]

> And forget about multithreading and composability. It's simply impossible to write fully correct multithreaded process management code.

We might be misunderstanding each other, somehow. But what you said is untrue. A thread can easily posix_spawn sub-process and waitpid for it. Even from inside library. Yes if process does blanket wait() in some other thread it wont work, but this seems borked design to me. (Is that one of use-cases quoted by article? Is there non-trivial programs or libraries doing such a thing ?)

There are definitely libraries doing sub-process spawning. E.g. I recently learned tensorflow does to compile some hw accelerator codes.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 2:15 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> A thread can easily posix_spawn sub-process and waitpid for it.

However, _any_ other wait/waitid() in the process can reap it, waits are not thread-scoped. So you can't have anybody in the process calling them. And if you ONLY do waitpid() calls, it might even be composable.

Except... you do have to call wait() periodically to avoid zombies, because your spawned process can die and reparent its children into your process.

Race-free process creation in the GNU C Library

Posted Sep 3, 2023 9:50 UTC (Sun) by roc (subscriber, #30627) [Link]

Children of a dead process normally reparent to pid 1.

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 3:32 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Ah, correct. I forgot that the code in quesiton also used the PR_SET_CHILD_SUBREAPER for some functionality.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 2:55 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> There are definitely libraries doing sub-process spawning. E.g. I recently learned tensorflow does to compile some hw accelerator codes.

Then they are quite likely unsafe, though in practice they would work fine in the vast majority of cases because typical race windows are pretty narrow. You really need malicious input and/or users to exploit that.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 4:07 UTC (Sat) by alkbyby (subscriber, #61687) [Link]

> Then they are quite likely unsafe, though in practice they would work fine in the vast majority of cases because typical race windows are pretty narrow. You really need malicious input and/or users to exploit that.

Well, your comment above about reparenting is only right for pid 1. And I am not sure how much software there is that "steals" other modules/libraries dead kids. My impression is there shouldn't be much.

I quickly inspected libuv for sub-process spawning, they don't steal. And glib. They also do the right thing (even with pidfd when available, since pidfd can be nicely polled).

With all that I am still curious what might the use-cases that people try to fix by proposed pidfd_spawn API. So far we've established it could be:

a) when process breaks wait{,pid} by ignoring SIGCHLD

b) when process has things that steal dead kids

But perhaps there are more. And I am curious how common those "bad" cases might be.

Why not fix process ids?

Posted Sep 2, 2023 6:32 UTC (Sat) by epa (subscriber, #39769) [Link]

Yes, a process id can be reused so you can’t reliably find the id and then use it later. This even applies to command line system administration where, in principle, you might kill a different process to the one you just saw in ‘top’. But why does it have to be that way?

Make process ids 64 bit and they can be unique for the lifetime of the system.

Why not fix process ids?

Posted Sep 2, 2023 7:04 UTC (Sat) by Subsentient (subscriber, #142918) [Link]

The obvious solution, the right solution, and I'm sure there's some irritating illegitimate reason that it won't happen.

Why not fix process ids?

Posted Sep 2, 2023 14:23 UTC (Sat) by corbet (editor, #1) [Link]

That can, indeed, be done now by messing with /proc/sys/kernel/pid_max. Making bigger process ID's the default will always risk breaking applications, though.

Why not fix process ids?

Posted Sep 2, 2023 16:04 UTC (Sat) by pebolle (subscriber, #35204) [Link]

Which has a hardcoded maximum of ~4,000,000 on 64 bits systems.

I could be misreading include/linux/threads.h, but since systemd on my (Fedora) system sets pid_max to that value out of the box I don't think I actually am.

Why not fix process ids?

Posted Sep 4, 2023 9:50 UTC (Mon) by mezcalero (subscriber, #45103) [Link]

systemd has been bumping this value to the max the kernel allows btw for a longer time. Not a single complaint reached us about that. The incompatibilities turned out to be mostly theoretic.

That said the kernel max is 22bit or so iirc, i.e. far from 32 or even 64bit...

Why not fix process ids?

Posted Sep 4, 2023 10:12 UTC (Mon) by pebolle (subscriber, #35204) [Link]

> the kernel max is 22bit or so iirc,

That's correct (and thanks for confirming my reading of include/linux/threads.h).

Why not fix process ids?

Posted Sep 4, 2023 19:14 UTC (Mon) by adobriyan (subscriber, #30858) [Link]

> This even applies to command line system administration where, in principle, you might kill a different process to the one you just saw in ‘top’.

It is simple to implement correct process killing. All programmer needs to do is to hold /proc/$pid descriptor while sending signal.
32 and 64-bitness doesn't change anything.

I've checked what htop does and it seems to do it wrong: it opens /proc/$pid then openat() few files from there but then closes directory.

41073 openat(3, "41057", O_RDONLY|O_NOFOLLOW|O_DIRECTORY) = 4
41073 openat(4, "task", O_RDONLY|O_NOFOLLOW|O_DIRECTORY) = 5
...
41073 close(5) = 0
...
41073 close(4)
...
41073 kill(41057, SIGTERM) = 0

Why not fix process ids?

Posted Sep 4, 2023 19:21 UTC (Mon) by adobriyan (subscriber, #30858) [Link]

Hey, it is even possible to do from the command line without "integrated" tools!

If kill -TERM is done from /proc/$pid !

$ ./pause &
[1] 41956

$ cd /proc/41956

# double check it is the same process, VERY IMPORTANT
$ cat comm #cmdline
pause

# send signal WITHOUT LEAVING /proc/$pid (VERY IMPORTANT)
$ kill -TERM 41956

# ... and it's gone!
$ cat comm
cat: comm: No such process

Why not fix process ids?

Posted Sep 4, 2023 23:15 UTC (Mon) by mchapman (subscriber, #66589) [Link]

No, this is not sufficient.

Between your "cat" and "kill" commands, the process could have exited, been reaped by its parent, and another process could have been forked with PID 41956. By the time you run kill, that PID may not be the same process you thought it was.

Simply holding a reference to the (old) /proc/$PID directory does not prevent the PID from being reused.

Why not fix process ids?

Posted Sep 14, 2023 14:50 UTC (Thu) by ksandstr (guest, #60862) [Link]

Well for one thing a 64-bit pid_t will break ABI. That's a no-no, but maybe not as big as the alternatives. It'd still have that strictly theoretical issue of identifier wraparound once 8 exi-PIDs have been spent, and command line users would curse your distant memory for having to reference PID 2**55+177, but ignoring those it'd work. Furthermore, if the format was redefined to reserve pid_t's top 32 bits for a L4-style version field which wouldn't appear in e.g. ps(1) output, wraparound would only have to be processed once a PID had been recycled 2**31 times -- though going from an "UI" PID to a full PID would entail an extremely tenuous TOCTOU issue[-1].

Another substitute solution to pidfds would make process IDs a capability of sorts, such that they're created by fork/spawn, transferred to other processes by unspecified means[0], and invalidated at wait() so they subsequently raise an error upon use. This would ensure that stale PIDs, being those that refer to a since-deceased process, don't end up referring to a different process. However the cost of doing this is a slight API break because kill() etc. would raise "unknown PID" while that PID might actually have come to exist again. Also the question of validating such a capability from e.g. command line parameters will need an answer.

Considering that any use of a PID is an instant TOCTOU hazard to any but the parent process (because it's the only one that can call wait() on that PID), the idea of "just fix the call sites" can be recognized unworkable in a great many cases. Analoguously to the capability idea above, pidfds provide a process-local identifier in the file descriptor[1] and a means to communicate process termination at time of use. And their cost isn't even an ABI break -- just that the old API will be creaky and the new API will be both nonportable, so extensive as to cover every POSIX call that takes a pid_t, and any pidfd_getpid() band-aid call will be another instant TOCTOU hazard (unless). Out of these approaches, pidfds certainly seem like an attractive solution since they mainly require lots of footwork and the creation of a "pre-horizon" category of vulnerable programs that process PIDs in any way.

[-1] this one would be soluble by invalidating wrapped PIDs in processes whose lifecycle intersects the wraparound point, another mild API break and perhaps the bane of init(8) in an interstellar probe or something.
[0] perhaps a general two-stage mechanism to validate a PID and then confirm its correct identity (using e.g. the program's fsid/inode# pair), or an unix domain socket faff not unlike fd transfer.
[1] though transferring these to another process would seem to require a unix domain socket between the two.
[2] there is no 4th footnote; I'm just using this space to point out that I'm currently unemployed but capable of spitting out this kind of off-hand analysis, and a suitably impressed reader's employer could almost certainly use a mad lad like me. *wink* *wink*

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 7:02 UTC (Sat) by ibukanov (subscriber, #3942) [Link]

Hm, I thought that pid could only be reused after the parent process called waitpid. So obtaining pidfd should always be OK as long as one does that before waitpid. If this is not the case, that seems like a bad bug in Linux.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 10:14 UTC (Sat) by bluca (subscriber, #118303) [Link]

Reliably identifying a process is not something that only its parent does. pidfds can be passed over to other arbitrary processes via SCM_PIDFD/SO_PEERPIDFD.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 10:57 UTC (Sat) by darmengod (subscriber, #130659) [Link]

So with existing interfaces the parent can know a child's PID and (race-free) get a pidfd to it, as long the precautions in pidfd_open(2) - NOTES are observed.

But when it sends the pidfd to some other process, the receiver has no way to get the PID (number) without /proc being accessible, and that is undesirable.

Couldn't this be resolved by convention at the application layer? So the original parent process doesn't just send an empty message with SCM_PIDFD attached, but includes the PID as a number in the regular message payload?

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 11:09 UTC (Sat) by bluca (subscriber, #118303) [Link]

That doesn't allow the receiver to verify anything, it's not just about knowing the pid, it's about knowing that it is still owned by the original process and not a recycled one. This is a real-world problem that has caused several CVEs, for example in polkit, and that so far has only been partially worked around by using unreliable heuristic like the start time in the target's proc/pid/status and other metadata, that can make it harder to exploit but not impossible

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 0:46 UTC (Mon) by njs (subscriber, #40338) [Link]

There are some subtle details here that I think the article and underlying thread didn't quite fully understand, and it took me a minute to read the thread and figure out what was going on, so here's my current understanding:

As a few other comments noted, it's already possible for parents to avoid the race condition when spawning a child and getting a pidfd to it, because pids aren't recycled when the child dies – they're recycled after the parent affirmatively calls one of the wait variants to reap the exit status. So "just" call pidfd_open before you call wait, problem solved.

But this is still useful for a few reasons:

- "make sure nothing in your program calls wait, or else a very obscure issue could happen one time in a million" is certainly an invariant you *can* enforce, but it sure is easier and less error prone if you don't have to.

- if you're writing a reusable library, you don't know what other code will be running in the same process as you. You might prefer to be robust against being used by poorly implemented callers, that do things like call wait on everything.

- if you're writing a highly backwards compatible library you *can't* add undocumented, observable side effects to your operations, even if the only code that would notice is arguably broken. Corollary: right now these libraries cannot move existing functionality out into helper processes, even if this would be eg better for security, and even if the user-visible API stays exactly the same. If mylib_do_foo() starts secretly spawning a child, that fact can't be encapsulated, because it will leak out into the process-global child monitoring APIs like SIGCHLD and wait.

But, there's an even more obscure Linux feature that can solve this: if you pass exit_signal=0 to clone, then the child process is hidden from not just SIGCHLD but also wait (!). Technically I think this is orthogonal to the pidfd stuff, but they're very convenient to use together, so it makes sense that a new interface exposing exit_signal=0 would also return a pidfd instead of a pid.

... Unfortunately the proposed patch only adds support for this to fork(), not to posix_spawn(), and the fork() support is controversial in general. But hopefully it'll get revised so we end up with exit_signal=0 *and* pidfd support in posix_spawn.

(It would also be nice if you could arrange that a child with exit_signal=0 and CLONE_PIDFD would be automatically orphaned when the pidfd was closed, since regular SIGCHLD reaping won't work on it. But that would be a whole other kernel patch.)

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 3:02 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

I'm still puzzled by this "race", as a process stays in zombie state for as long as its parent has not reaped it, and during this time the PID remains assigned to it and cannot be reused. My understanding here is that we're talking about the parent watching the child. So that feels strange. I understand the problem as the parent reaping the child then trying to reference it. That sounds odd. Maybe there are indeed races around this but I can't see them. If the parent uses multiple threads to manipulate the child's FD and to reap it, then it should at least use a lock between them to avoid TOCTOU, but that still feels odd and I'm interested in an explanation of the particular case which causes a race here, i.e. the process disappears and is replaced before it was reaped.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 3:15 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

The problem is that wait() called from _any_ thread can reap any child. So you need to absolutely prevent it from being called while you're in a "critical section" of getting a process and manipulating it.

It also does not help at all if the process in question is not your child.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 9:11 UTC (Tue) by wtarreau (subscriber, #51152) [Link]

Sure but it's then trivially handled by using a lock to make sure that only one thread at a time may either wait() or kill():

sigchld_handler()
{
lock(pidlock);
pid = wait(NULL);
reap_child(pid);
unlock(pidlock);
}

signal_child(int child, int sig)
{
lock(pidlock);
pid = get_pid_from_child(child);
if (pid)
kill(pid, sig);
unlock(pidlock);
}

I'm sorry but I continue to think the problem is mostly made up.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 16:01 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> I'm sorry but I continue to think the problem is mostly made up.

Can you find a single library that starts subprocesses, that has hooks for these kinds of locks? That's what I mean by "not composable".

Also, having to do such lock dances is an indication of a bad API in itself.

Race-free process creation in the GNU C Library

Posted Sep 6, 2023 14:02 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

> Can you find a single library that starts subprocesses, that has hooks for these kinds of locks?

I don't know, since I don't know what such libs currently do. But it would seem like the correct thing to do if they claim to be thread-compatible.

> Also, having to do such lock dances is an indication of a bad API in itself.

If necessary it could be wrapped into a simpler API. But the locks are precisely due to a race which is inherent to process reaping/signaling that can be happening in parallel and that one needs to serialize. I don't see why one must suddenly start to make an exception for this specific case and say "let's pretend there is no race here so that we can save one lock" nor "let's assume programmers creating threads don't understand the limits of threads". I would, however, clearly welcome an in-libc pair of wrappers that just adds these locks around wait() and kill() such as locked_wait() and locked_kill() to be more friendly to the user and to lib developers. But my feeling is that if it's just for this, it's becoming overkill, and the fact that it started a discussion seems to indicate others have the same feeling.

Race-free process creation in the GNU C Library

Posted Sep 6, 2023 15:22 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> I would, however, clearly welcome an in-libc pair of wrappers that just adds these locks around wait() and kill() such as locked_wait() and locked_kill() to be more friendly to the user and to li

No. You need to wrap _user_ code in locks. It's not just functions themselves. I.e.:

You have to write ALL process-related code in this manner:

1. get_lock
2. pid = create_process()
3. verify_pid_is_correct(pid)
4. kill(pid, 9)
5. release_lock()

This can't be wrapped into simple locked functions, unless you want to have a closure-based API. And even then you'll have all the locking-related issues, like deadlocks.

In short, the classic process API is inherently broken in the presence of threads. It can't be sanely fixed.

Race-free process creation in the GNU C Library

Posted Sep 6, 2023 20:53 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

Well, you're right regarding the need to put the locks into the application. That doesn't mean the API is broken, it's just possibly not convenient enough for some users.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 0:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

If it can't be sanely used then it's broken. It's that simple.

I consider having to do locks in sometimes inconvenient places to block "spooky action at a distance" the very definition of brokenness.

Also, there's still a case where you might need to do operations (e.g. send signals) to processes that are not your children.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 4:00 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

> I consider having to do locks in sometimes inconvenient places to block "spooky action at a distance" the very definition of brokenness.

If so, absolutely everything involving threads or communication with other processes is broken. I'm sorry but I disagree with this definition.

> Also, there's still a case where you might need to do operations (e.g. send signals) to processes that are not your children.

Yes, and this has always shown a moderate reliability only. That's the classical "ps auxw" then "kill $pid". I don't see what in the proposed API could improve this situation at all.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 4:05 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> If so, absolutely everything involving threads or communication with other processes is broken. I'm sorry but I disagree with this definition.

Imagine that memory allocations worked the same way. Instead of getting a pointer, you get a "zone ID" with the same semantics as PIDs.

Sorry, but plenty of interfaces are well-designed and work just fine with threads. In libc: memory allocations, file operations, IPC primitives, etc.

To be fair, libc also has a plenty of other broken interfaces: the notion of the current directory, non-reentrable functions, the whole mess with locales and timezones.

> I don't see what in the proposed API could improve this situation at all.

Uhm... You get a pidfd and you can use it to make sure that the PID won't be reused while at least one pidfd descriptor is open. This makes it possible to do race-free process manipulation.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 18:26 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

> Imagine that memory allocations worked the same way.

Actually you gave a pretty good example, because memory allocations *do* work the same way. If one thread tries to access a memory location while another one is freeing it and without coordinating together, you'll pretty quickly see either a use-after-free bug or a basic segfault.

> You get a pidfd and you can use it to make sure that the PID won't be reused while at least one pidfd descriptor is open. This makes it possible to do race-free process manipulation.

I'm just seeing it as convenience at the expense of extra FDs, which may in some cases result in new classes of bugs such as leaks if some FDs are passed by accident or just lost without being closed or stuck into a UNIX socket but closed so that nobody sees it, and even possibly vulnerabilities later if accessing such an FD is possible and is sufficient to send a signal over it despite the processes not being supposed to be able to interact.

Don't get me wrong, I'm not saying it's bad, we all love when some APIs are made easier to use or open new possibilities. It's just that I don't feel like this was that difficult to use correctly and that the small extra efforts probably did not warrant the possible classes of issues that will inevitably come with it. Time will tell.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 18:39 UTC (Thu) by bluca (subscriber, #118303) [Link]

> It's just that I don't feel like this was that difficult to use correctly

Then you still haven't quite grasped what the actual problems being solved here are, and it might be time to go look at the sources linked in the article before further commenting, especially the cover letters and the linked bugzillas

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 19:31 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> Actually you gave a pretty good example, because memory allocations *do* work the same way. If one thread tries to access a memory location while another one is freeing it and without coordinating together, you'll pretty quickly see either a use-after-free bug or a basic segfault.

Now imagine that allocation can be freed at any moment.

> I'm just seeing it as convenience at the expense of extra FDs,

FDs are not a scarce resource.

> which may in some cases result in new classes of bugs such as leaks if some FDs are passed by accident

How is that different from any other FDs?

> or just lost without being closed

Don't lose resources.

> or stuck into a UNIX socket but closed so that nobody sees it

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 9:46 UTC (Tue) by bluca (subscriber, #118303) [Link]

> My understanding here is that we're talking about the parent watching the child.

No, it is not, this is about tracking any process from any other process in a race-free manner end-to-end, as dbus/polkit/systemd clients/etc need to do.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 9:10 UTC (Tue) by tlamp (subscriber, #108540) [Link]

The command arguments name for file/path seem to be switched for pidfd_spawn and pidfd_spawnp?

As for posix_spawn the parameter name path is used (i.e., relative or absolute) and for posix_spawnp the parameter name file is used (i.e., a filename that is looked up through PATH environment variable).

See https://manpages.debian.org/bookworm/manpages-dev/posix_spawn.3.en.html.

So shouldn't the signatures look like:

    int pidfd_spawn(int *restrict pidfd,
                    const char *restrict path,
                    const posix_spawn_file_actions_t *restrict facts,
                    const posix_spawnattr_t *restrict attrp,
                    char *const argv[restrict],
                    char *const envp[restrict]);

    int pidfd_spawnp(int *restrict pidfd,
                     const char *restrict file,
                     const posix_spawn_file_actions_t *restrict facts,
                     const posix_spawnattr_t *restrict attrp,
                     char *const argv[restrict_arr],
                     char *const envp[restrict_arr]);

Race-free process creation in the GNU C Library

Posted Sep 14, 2023 19:03 UTC (Thu) by the8472 (guest, #144969) [Link]

For the rust standard library we've found a pattern that gets use a race-free pidfd without using clone3.

Open a unix socket pair, fork, pidfd_open in the child, send the fd to the parent, do other process setup stuff, exec. We already had a communication channel to the parent anyway for error handling, previously it was a pipe, so it wasn't a big change.