But this only happens if the copying collectors' space is growable and discontinuous. I think a better scheme is to just use copying for the young generation and mark and sweep for the older ones. Structuring the heap into blocks seem quite misguided, imo.
This is exactly how the JVM CMS works, then what happens is the old gen becomes too fragmented to allocate, you end up with a GC failure and the JVM will rewrite the entire old heap.
When hadoop depended on a single namenode, a 64gb+ heap would take 5+ minutes to collect on these failure modes. Quite painful.
Yes, but there are two easy solutions: 1. Larger nurseries to minimize churn in older generations. Fragmentation is a symptom of too many "false survivors". 2. Concurrent heap compaction.
> Structuring the heap into blocks seem quite misguided, imo.
So, how would you keep allocations fast on multi-core hardware?
FTA: “I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.”
Per-core heaps. Count, sizes and proportions of generational heaps can also be scaled dynamically according to allocation rate, target % CPU time spent in GC and collection efficiency, like .NET's GC does.
Per-thread nurseries worked just fine for Java for close to a decade. They made it more efficient using escape analysis. Escape analysis indirectly lead to borrow checking and Rust.
I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.
Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.
> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.
Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.
There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/
Ah yes, thanks for the detailed explanation! By "typical application" I mean those applications where there are far more (mostly idle) threads than available cores. I guess I shouldn't have used the phrase "typical application" since this differs so much by people's experience.