Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.
> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.
Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.
There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/
Ah yes, thanks for the detailed explanation! By "typical application" I mean those applications where there are far more (mostly idle) threads than available cores. I guess I shouldn't have used the phrase "typical application" since this differs so much by people's experience.
Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.
> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.
Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.
There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/