Per-thread nurseries worked just fine for Java for close to a decade. They made ...

kccqzy · on July 10, 2024

While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

convolvatron · on July 10, 2024

that would almost certainly be true if you could make statements about thread-cpu affinity

scottlamb · on July 10, 2024

> While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.

> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.

There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/

kccqzy · on July 11, 2024

Ah yes, thanks for the detailed explanation! By "typical application" I mean those applications where there are far more (mostly idle) threads than available cores. I guess I shouldn't have used the phrase "typical application" since this differs so much by people's experience.

gpderetta · on July 11, 2024

If you allow thread migration across CPUs, then per-CPU nurseries become problematic when collecting, unless you stop the world.