10

Tracking Issue for `#![feature(available_parallelism)]` · Issue #74479 · rust-la...

 2 years ago
source link: https://github.com/rust-lang/rust/issues/74479
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

New issue

Tracking Issue for #![feature(available_parallelism)] #74479

6 of 7 tasks

yoshuawuyts opened this issue on Jul 18, 2020 · 36 comments

Comments

Copy link

Member

yoshuawuyts commented on Jul 18, 2020

edited

The feature gate for the issue is #![feature(available_parallelism)].

This is a tracking issue for std::thread::available_parallelism; a portable API to determine how many threads to spawn in order to ensure a program can make use of all available parallelism available on a machine.

Public API

#![feature(available_parallelism)]
use std::thread;

fn main() -> std::io::Result<()> {
    let count = thread::available_parallelism()?.get();
    assert!(count >= 1);
    Ok(())
}

Tasks

  • Resolve discussion on naming.
  • Resolve discussion on function signature.
  • Resolve discussion on terminology.
  • Add support for externally-set limits on Linux (ref).
  • Add support for externally-set limits on Windows (ref).
  • Update documentation to remove mentions of "hardware" (ref).
  • Smooth out the docs example.

Notes on Platform-specific behavior

available_parallelism is (for now) only considered a hint. The following platform limitations exist:

  • On Windows the available concurrency may be undercounted when exceeding 64 threads #74480 (comment)
  • On Windows the available concurrency may be overcounted when limited by a process wide affinity mask or job object limitations #74480 (comment)
  • On Linux the available concurrency may be overcounted in the face of cgroup limits and task-wise affinity masks #74480 (comment)

Implementation history

Copy link

Member

Author

yoshuawuyts commented on Sep 16, 2020

edited

A fix for the linux undercounting exists in num_cpus which is license-compatible with the compiler. I believe the fix boils down to including this conditional for the linux detection code:

if unsafe { libc::sched_getaffinity(0, mem::size_of::<libc::cpu_set_t>(), &mut set) } == 0 {
    let mut count: u32 = 0;
    for i in 0..libc::CPU_SETSIZE as usize {
        if unsafe { libc::CPU_ISSET(i, &set) } {
            count += 1
        }
    }
    count as usize
} else {
    // `libc::_SC_NPROCESSORS_ONLN` code we're currently using
}

cc/ @luser

Copy link

Contributor

the8472 commented on Sep 16, 2020

@yoshuawuyts that would be insufficient due to cgroup CPU quotas which gets your entire cgroup frozen until the next time slice when the quota is used up. docker and other containerization services makes heavy use of that for resource allocation.
One can and other runtimes (e.g. java) do calculate the equivalent of available full-time cpu-cores per time slice from the quotas.

see https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt

I would suggest that taking into account Linux a) cpusets (#74479 (comment)) and b) cgroup cpu bandwidth limits (#74479 (comment)) are considered blockers of stabilization of available_concurrency.

Rationale: the change b) in num_cpus was a source of significant CPU efficiency gains and potential memory savings in cloud-based environments. If available_concurrency is stabilized, then it is expected people will migrate to it from num_cpus. That would currently cause a regression.

This is complicated by the fact that associating "Kubernetes CPU limits" with "cgroups" and https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt is non-trivial as there are abstraction layers between them. (so people may not be aware of this even in the presence of documentation on available_concurrency)

The fix can also be to include some num_cpus code, but the cgroups part is more substantial than the cpusets one.

Copy link

Member

Author

yoshuawuyts commented on Oct 27, 2020

edited

@strohel I definitely agree we should support cpusets before stabilizing: we may currently over-report concurrency in containers which is something that should be fixed.

But I'm less sure about cgroup cpu bandwidth limits. If I'm understanding #74479 (comment) this would cause the reported number of CPUs to fluctuate during the runtime of the program. I believe this may cause issues when used in examples like the following:

use std::thread;

fn main() -> io::Result<()> {
    // perform init work here (connect to db, warm caches, etc.)

    let max_threads = thread::available_concurrency().map(|n| n.get()).unwrap_or(1);
    let pool = ThreadPool::init(max_threads);

    // use pool for the duration of the program
}

If the boot sequence of a program is CPU-intensive, by the time the thread pool is initialized the CPU quota may have been expended. Which means that the thread pool may set a limit lower than the max supported concurrency for the program, causing an overall under-utilization of resources.

It would seem that the right use of a bandwidth-aware API would be to periodically update the concurrency. Not to statically initialize it once since that risks under-utilization if the quota is at a low. Did I interpret this accurately?

edit: On closer reading I think I may have indeed misinterpreted. The quota value is statically set, and not constantly updated during a program's runtime. Which means it can indeed be used to calculate the max concurrency and does not need to be updated in a loop. If this is indeed the case, I agree we should account for this.

@yoshuawuyts I see you've just edited the comment; the clarification is correct.

It is true the one can change CPU quota of a running program, but the same AFAIK holds about cpu sets. Linux even supports runtime physical CPU hotplugging. That's a non-issue, the API can do nothing about it (beyond mentioning in the docs). Caching of the value is problematic for this reason.

Copy link

Contributor

the8472 commented on Oct 27, 2020

If the boot sequence of a program is CPU-intensive, by the time the thread pool is initialized the CPU quota may have been expended. Which means that the thread pool may set a limit lower than the max supported concurrency for the program, causing an overall under-utilization of resources.

Perhaps my post was not clear. There is a quota-configuration which you can read from sysfs (as quota per interval and interval length) and the runtime depletion of the quota. The depletion is dynamic, but the configuration does not change unless someone explicitly reconfigures them. available_concurrency() would be based on the configuration, not the remaining quota available in the current time slice.

I'm deeply against including this in Rust's core API. Whilst conceptually it seems helpful and even useful, the technique used for Linux isn't just broken with regard to containers, it's broken in a number of other situations as well. It turns out doing this well is extremely hard. I've gone through several iterations of my own to get it 'right'. First up sysconf(_SC_NPROCESSORS_CONF) can not be relied upon eg in musl libc, as it uses the syscall sched_getaffinity. AS my notes in my linux-support crate make clear, it becomes unreliable after the sched_setaffinity syscall has been made. As well as taking into account current cgroup2 settings (eg effective cgroup hyper threads) one needs to be aware that hyper threading may have been disabled and some threads isolated from the OS (eg https://github.com/lemonrock/linux-support/blob/master/workspace/linux-support/src/cpu/HyperThread.rs#L171). Whether hyper threading matters or not to the count is also important; sometimes how threads share cpu caches and resources (eg https://github.com/lemonrock/linux-support/blob/master/workspace/linux-support/src/cpu/HyperThread.rs#L302), which NUMA nodes they are associated with, is it online etc; all have a significant impact.

Whilst wanting a count of cpus is a common request, naively just spinning up threads for all cpus, or all cpus minus 1, say, is never going to work unless one can ensure a dedicated OS instance to run on.

@strohel - the API can do something about CPU hotplugging. It's quite discoverable.

Copy link

Contributor

ibraheemdev commented on Sep 1

Is there any reason not to call this num_cpus?

Copy link

Member

Author

yoshuawuyts commented on Sep 1

@ibraheemdev you can find a conversation that lead to the current naming here: #74480 (comment).

The short version on "why not name it num_cpus?" is because this API does not return the number of CPUs in a computer (neither logical or physical). Instead it returns the amount of "concurrency" [1] available to a process at that time.


[1]: We borrowed the word "concurrency" from C++'s hardware_concurrency API.

Copy link

Member

joshtriplett commented on Sep 27

I submitted #89310 to implement available_concurrency using sched_getaffinity on Linux, which handles the case where the process doesn't have access to all CPUs, such as when limited via taskset or similar.

@joshtriplett I still think stabilizing this is a really bad idea; see my comment above on sched_setaffinity. If you really want to optimize concurrency, then I think folks should use specialized crates; for example, my linux-support crate. It's far richer and lets one handle a lot of confusing situations.

Copy link

Member

joshtriplett commented on Oct 1

@raphaelcohn wrote:

If you really want to optimize concurrency, then I think folks should use specialized crates

Rust projects need some portable mechanism to determine a default number of threads to run, even if that's an approximation. (Projects currently use the num_cpus crate for this.) Crates that run on many platforms and need to use the full capabilities of the system's CPUs are not necessarily interested in adding various non-portable code paths to detect further nuances of the system. Some crates may wish to do a more in-depth adaptation to the exact capabilities of the system, but many crates just want a single number for "how many threads should I spawn", and it makes sense to have a portable API to return that number, without making an application have to directly deal with the system-specific nuances that might go into such a number.

The purpose of the available_concurrency() call is not to determine the number of CPUs on the system; it's to determine a reasonable default for the amount of concurrency to use. I did see your comment above about sched_setaffinity; however, if someone uses sched_setaffinity to limit the number of CPUs the process can run on, it's correct for available_concurrency() to return a correspondingly smaller number of CPUs, so that the process defaults to running a thread for each CPU that its affinity allows it to run on.

If you feel there's some additional nuance this function could be detecting to provide a better default, I'd be happy to review further pull requests that detect (for instance) additional aspects of cgroup configuration. But I do believe we should encourage people to use an API abstraction like this, to provide a reasonable default.

@joshtriplett I couldn't disagree more. Portability by compromise is a poor goal; most Rust developers trust the standard libraries to produce performant code (something I myself used to believe). However, I'm not going to hold this up. If this change goes in, then at the very least, could we please caution those that use it that is a simplistic solution and highlight that the underlying implementation on Linux is brittle.

Lastly: "if I feel". This is clumsy expressed and leads to a demonstration of power, not equality; clearly I do feel this way. It's clear that you want to make this change, that you have the power to make this change, and you have a majority view. I have no interest in engaging further.

Copy link

Member

Author

yoshuawuyts commented on Oct 6

Follow up to #74480 (comment)

I had a chat today with @sivadeilra today on how to approach the 64-thread limit on Windows, and whether we could detect more than 64 threads on a system by default. They mentioned that in order to do that we need to change the ambient state of a program, which a lookup function probably shouldn't do.

One option @sivadeilra mentioned we could explore is adding an API via std::os::windows::* to explicitly change the ambient state. We should probably prototype this on crates.io first though.


Follow up to #74480 (comment)

While looking through the discussion on the original PR I also realized that we weren't tracking #74480 (comment). It mentions that on Windows might be able to limit the amount of concurrency available to a program via process affinity masks / job objects. @sivadeilra do you have a sense for how we could support that? Would getting those values also interact with the ambient state?

yoshuawuyts

changed the title Tracking Issue for #![feature(available_concurrency)]

Tracking Issue for #![feature(available_parallelism)]

on Oct 6

Copy link

Member

joshtriplett commented on Oct 13

@raphaelcohn FWIW, my intent was precisely the reverse. I was trying to explicitly invite further refinement to the implementation, to avoid the implication that it shouldn't account for target-specific complexity when it can.

Copy link

Member

Author

yoshuawuyts commented on Nov 9

The API, documentation, and naming have all undergone thorough review in the past few months. And with #89310 merged, we are now returning accurate numbers for the majority of uses.

The only known limitation at the moment is that we're not detecting externally-set thread count limits on Windows. While not unimportant, there is nothing in the API or documented guarantees that prevents us from improving accuracy later on.

It's been well over a year since work on this started, and I think we've reached a point where we are ready to propose stabilization. @joshtriplett, would you be able to start a libs FCP?

Copy link

Contributor

the8472 commented on Nov 9

The only known limitation at the moment is that we're not detecting externally-set thread count limits on Windows.

CFS bandwidth limits (commonly used to limit containers) have been mentioned several times. I consider that a significant limitation and #89310 does not address that since bandwidth limits and affinities are independent ways to allocate CPU resources.

Copy link

Member

Author

yoshuawuyts commented on Nov 9

@the8472 Thanks for raising that, that makes sense! I didn't realize Linux had two distinct ways of restricting CPU resources. Just to make sure we're entirely on the same page; what you're saying we're still missing is an equivalent to num_cpu's cgroups_num_cpus logic?

Copy link

Contributor

the8472 commented on Nov 9

Yes, although that particular implementation is debatable since it caches the calculation even though quotas can be changed at runtime.

Copy link

Member

joshtriplett commented on Nov 9

edited

@the8472

CFS bandwidth limits (commonly used to limit containers) have been mentioned several times. I consider that a significant limitation and #89310 does not address that since bandwidth limits and affinities are independent ways to allocate CPU resources.

That's absolutely true. However, looking into it further, I'm not actually sure the degree to which either CPU shares or CFS bandwidth limits should affect available_parallelism, and in particular I'm not sure the semantic implemented by num_cpus is exactly what we want.

(By contrast, I do think we should make sure to account for cgroup CPU sets.)

Talking it through for each of the types of cgroup restriction:

  • cpusets are directly a limitation on affinity, and I think it's appropriate for available_parallelism to respect them.
  • CPU shares limit what percentage of the CPU you get, and effectively throttle your process down proportionally if other processes are also using the CPU, while allowing you to burst past that if the CPU isn't busy. Given that if the CPU isn't busy you're allowed to saturate it, I don't think this should affect available_parallelism.
  • CFS bandwidth limits don't have bursting, and limit your process to a maximum number of CPU microseconds per period of time even if the CPU would otherwise be idle. However, it's not obvious to me whether, if you're allowed to use a fourth of the available CPU bandwidth of the system, it's always better to use that by using a fourth of the total CPUs, rather than using all the CPUs for a fourth of their bandwidth. I can imagine scenarios that would favor either approach, and neither seems like an obvious win over the other.
    • As an example: if you expect to saturate the CPU for longer than the quota period, and you care about the latency between times you get to run, you may want to narrow to a number of CPUs that lets you run continuously (e.g. 2 of 8 CPUs if you have a 1/4th quota). On the other hand, if you expect to do a CPU-bound computation for a relatively short amount of time, potentially less than the quota, then running on all 8 CPUs may let you finish your computation four times faster. And if you're doing non-CPU-bound work, such as responding to HTTP requests, running on all CPUs may give a burst of requests lower overall latency because there are more CPUs to respond to requests concurrently, as long as you don't saturate your CPU bandwidth.
    • Also, CFS bandwidth limits are more complex than just checking one level of quota; you're allowed to have a hierarchy and "oversubscribe" that hierarchy, such that an outer cgroup may be more restrictive than an inner one it contains. Do we really want to walk up the tree and compute the net effect of the quotas?

There's also a general principle that I would tend to favor here: in the absence of a clear answer that's a winner much more often than it isn't, having simpler and more predictable behavior seems preferable. When people set CFS bandwidth limits, they expect that to affect the scheduler; it's not entirely obvious that they would also expect random applications to read those limits and run fewer threads than they otherwise would have. In particular, if you want an application to run on fewer threads, you might want cpusets or affinity, rather than CFS bandwidth limits; if you've set CFS bandwidth limits, perhaps you want to run on all CPUs for a fraction of bandwidth, rather than a fraction of CPUs for all their bandwidth.

Given all of the above, I personally would favor not taking CFS bandwidth quotas into account by default, in the interests of optimizing for the "finish fast" case or the "handle bursts of requests in parallel" case, and in the interests of simplicity (especially in the face of hierarchical quotas).

Note: I'm happy to start the FCP on this once it settles, but I don't want to do so after just having presented a new argument for consideration.

Copy link

Contributor

the8472 commented on Nov 9

That's absolutely true. However, looking into it further, I'm not actually sure the degree to which either CPU shares or CFS bandwidth limits should affect available_parallelism, and in particular I'm not sure the semantic implemented by num_cpus is exactly what we want.

Prior art:

  • obviously num_cpus, see positive impact in the original PR
  • the java runtime. They use this for scaling their multi-threaded GCs and the availableProcessors() standard library API.
  • .NET - reflects quotas by default, using the thread count instead requires opt-in

cpusets are directly a limitation on affinity, and I think it's appropriate for available_parallelism to respect them.

But they require static allocation of individual cores, that's why the default strategy of container runtimes is to allocate based on quotas or shares and let the kernel scheduler hash out the details.

CPU shares limit what percentage of the CPU you get, and effectively throttle your process down proportionally if other processes are also using the CPU, while allowing you to burst past that if the CPU isn't busy. Given that if the CPU isn't busy you're allowed to saturate it, I don't think this should affect available_parallelism.

Agreed, shares shouldn't factor into it.

On the other hand, if you expect to do a CPU-bound computation for a relatively short amount of time, potentially less than the quota, then running on all 8 CPUs may let you finish your computation four times faster. And if you're doing non-CPU-bound work, such as responding to HTTP requests, running on all CPUs may give a burst of requests lower overall latency because there are more CPUs to respond to requests concurrently, as long as you don't saturate your CPU bandwidth.

That seems like a fragile situation. If you want to finish fast that only works as long as long as long as the combined workload, including the increased synchronization overhead from running more threads and the decreased throughput from decreased turbo clocks, stays below the bandwidth limit. As soon as you hit that you fall off a latency cliff instead of gentle degradation. If you care enough about latency that you distribute the work of individual requests over multiple threads you're probably not going to blindly apply CFS bandwidth limits or you might configure your thread pools manually based on benchmarks.

And if you're doing non-CPU-bound work, such as responding to HTTP requests, running on all CPUs may give a burst of requests lower overall latency because there are more CPUs to respond to requests concurrently, as long as you don't saturate your CPU bandwidth.

If you have a threadpool for blocking IO tasks then setting it to the available CPU cores is rarely a good metric anyway since IOPS, external services or whatever will be the limiting factor, not available cores. Even on a single-core machine you'll want more threads to keep io queues full. Imo that's not a good argument against factoring quotas into available_parallelism.

CFS bandwidth limits don't have bursting,

They optionally do now with cfs_burst_us. Granted, this hasn't trickled down to containers yet.

Copy link

Member

joshtriplett commented on Nov 10

@the8472 The future availability of bursting makes me more inclined to think we shouldn't limit available_parallelism based on CFS bandwidth.

I'm less concerned about the case of many HTTP request/response threads, and more concerned about the case of CPU-bound computation that can finish within one quota period (potentially including bursting).

Copy link

Contributor

the8472 commented on Nov 11

edited

I'm less concerned about the case of many HTTP request/response threads, and more concerned about the case of CPU-bound computation that can finish within one quota period (potentially including bursting).

That concern seems quite theoretical to me. It falls apart in several related ways.

(1) On a system where CPU quotas are enforced to a degree where a service gets a much smaller slice than the available hardware threads that usually is because there are other services doing their own work, which means your service may not even benefit from spreading its work over many threads because it may end up competing with the other services.

(2) When you optimize for latency you cannot just optimize for the minimum latency. Even the 99th%ile latency of individual work items can end up determining the user-perceived overall latency. Now let's consider what happens when you distribute work over all hardware threads. The minimum latency can go down assuming the work is parallelizable. But once it begins working on the Nth item and the quota runs out all the threads will be stuck and that request will be stalled for the rest of the quota period. Which drives up tail latencies. Even when your request queues are otherwise empty (e.g. when the system auto-scales based on queue size). That's because by the time that the threads get frozen the item is already popped off the work queue. To lower the variance of latencies one has to make sure that threads don't get frozen and one does that by only starting as many threads there is CPU quota available.

(3) even ignoring the increase in tail latencies in (2) median latencies will also go up under load because it's keeping more cores active, increasing IPC overhead and decreasing available turbo clocks. Competing with other other services (1) will also lead to more involuntary context switches and cache pressure. Lower throughput for the same amount of spent core-seconds -> higher latencies

(4) The scenario only makes sense if the whole system is almost always under-utilized AND work arrives at a very predictable rate (if it's a poisson process then random queue depth spikes can screw you over via (2)). But in such a situation you wouldn't need quotas in the first place.

Copy link

Member

Author

yoshuawuyts commented 22 days ago

It seems like the conversation has reached a standstill once more. There appear to be two opposing, well-argued for positions, and it doesn't appear we'll automatically come to a resolution. The way I see it, we now have several options to choose from:

  1. We punt the discussion on the implementation detail, and move forward to FCP with the implementation as-is. What's currently being discussed does not affect the end-user API, and can freely be adjusted post-stabilization as well.
  2. @the8472 files a PR implementing their proposed changes, and once we've merged that we move to FCP. Similar to option 1, we're at liberty to change the underlying implementation as long as the API and documented guarantees don't change.
  3. In the case neither of these options seem acceptable, the libs team schedules this as a discussion topic on the agenda of one of the sync meetings, with the explicit goal of figuring out to proceed.

@joshtriplett @the8472 I'd love to hear from you!

Copy link

Contributor

the8472 commented 22 days ago

Well, I'm fine with 1 or 2 as long as adding the cgroup quotas is an acceptable change itself. But that acceptability is where the disagreement with josh lies I believe.
If we need a more complex API that distinguishes between steady-state and burstable parallelism or perhaps fractional CPU resources then maybe the current API shouldn't be stabilized.

Copy link

Member

joshtriplett commented 22 days ago

edited

@the8472 That's a reasonably convincing argument, thank you. I do think there are workloads (especially with bursting) where it makes sense to use all available CPUs, if you expect the load on services on the same machine to be mostly uncorrelated. However, I find your point about the 99% latency sufficiently compelling to suggest that that should be the default behavior. (We should document that behavior clearly, though.)

So, I'd be on board with the plan of:

  • Stabilize this as-is.
  • Merge a PR to add cgroup bandwidth limits (and corresponding documentation) as soon as that PR becomes available.
  • Consider adding a second API for available_parallelism_burst or similar at that time, which respects affinity but ignores bandwidth limits. We can consider possible designs at that time, such as a builder that can add more options in the future, but I think whatever design we might land on there will still need a simple no-parameter free function like the current available_parallelism, and I think that simple function should respect cgroup bandwidth limits by default.

Copy link

Member

joshtriplett commented 22 days ago

edited

@rust-lang/libs-api, shall we stabilize std::thread::available_parallelism(), and also plan to merge future enhancements to it to support cgroup bandwidth limits?

@rfcbot merge

Copy link

rfcbot commented 22 days ago

edited by dtolnay

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

Copy link

Contributor

Amanieu commented 22 days ago

The main question that this API should answer is: how big should I make my thread pool for CPU-bound tasks. In this case it's better to overestimate (costs memory) than underestimate (missing out on potential parallelism). Essentially, this function should return what is the maximum potentially available parallelism.

Copy link

Contributor

the8472 commented 22 days ago

What kind of underestimate are you worried about?

Copy link

Member

cuviper commented 22 days ago

In this case it's better to overestimate (costs memory) than underestimate (missing out on potential parallelism).

Overestimating also costs performance, like cache contention and kernel scheduling overhead.

Copy link

strohel commented 21 days ago

Overestimating also costs performance, like cache contention and kernel scheduling overhead.

To support this with a data point: earlier I benchmarked a simple actix-web based microservice. Overestimating the number of worker threads (from 1 to 4) resulted in ~50% increase of CPU time per request.

I can re-run the benchmark with up-to-date versions and different parameters if requested.

Copy link

rfcbot commented 19 days ago

bellThis is now entering its final comment period, as per the review above. bell

Copy link

Contributor

the8472 commented 15 days ago

#91057 updates the documentation in anticipation of cgroups support.

Copy link

rfcbot commented 9 days ago

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Linked pull requests

Successfully merging a pull request may close this issue.

None yet

12 participants

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK