I don't have a workload I'd prefer to see on 2x128-core: We're already microserv...

dragontamer · on July 6, 2021

> As for datacenter size/cost, it's a good point, but what if two 1-Socket servers could take up the same space as one 2-Socket server? :-) That may never happen, but some level of space optimization will

Oh it certainly exists. Computers are space-optimized to the point of nonsense. IIRC, most people don't even bother to use widely available 1U servers because you run out of power before you fill up 40U racks.

Hyperscalers, such as Google / Netflix / Amazon are a bit different of course (IIRC, you work at one right?), since they can specially build their data centers to have far denser power-delivery and actually support 40-computers or even 80-computers per rack. But more typical offices simply do not have the power-density to run 1U nodes or smaller (Ex: Supermicro 2xNodes in 1U nodes or Supermicro 4xNode in 2U).

In effect: modern computer systems usually run out of power before they run out of rack space. Especially when you consider that every Watt-delivered turns into Heat (Watts) generated, which then requires a more powerful air-conditioner to keep the room within operating specs.

So you're right that modern datacenters probably don't care about size. Space is relatively cheap, power-lines are expensive! 2x Sockets for 2x per 1U == 160 CPUs per 40U rack. 10 such racks would use over a Megawatt of power once we factor in air conditioning, so a typical building just won't handle that.

-------------

> I don't have a workload I'd prefer to see on 2x128-core: We're already microservices running across a pool of instances, and would prefer a bigger pool of faster instances than a smaller pool of slower ones at the same cost. Once we get a workload running on 100+ cores, I often see a lot of lock contention anyway. Going bigger usually makes that worse (worse ROI).

But that "lock contention" you're measuring is something like 250ns to 500ns over a channel that's 800Gbit/sec thick.

EDIT: To be more specific: I'm talking about the MESI messages going over the NUMA fabric.

In contrast, a packet over 10 Gbit Ethernet is basically two orders of magnitude less bandwidth and an order of magnitude more latency (maybe 2500 to 5000 nanoseconds of latency?).

If the application were truly limited by the communication paradigm between the NUMA Fabric, switching to Ethernet or InfiniBand would only slow it down further.

EDIT: Case in point: we don't do spinlocks over Ethernet. I mean, we could in theory (RDMA a region of memory over Ethernet and then hold it as a Spinlock), but we all know its a bad idea. We do spinlocks at L3 and/or the NUMA Fabric level (and maybe we'll do it over PCIe 5.0 / CCIX level, as cache-coherent I/O becomes possible).

brendangregg · on July 7, 2021

Ah, thanks for the details, I wasn't referring to NUMA-induced lock contention, but rather lock contention in general. I've seen a workload hit 64 CPUs with 80% of CPU time in lock contention (and others in the 10-30%). Now, while that means the developer has a big problem to fix, it also has me wondering about single sockets getting big enough -- 128 CPUs is already hard to use well. Back when I first saw multi-socket systems with 2-4 total CPUs, getting the extra CPUs online was all goodness. But adding another 128 CPUs to my already 128-CPU system, well...is there a point where we can say, in general, that we already have enough cores? In the talk I referred to the 850,000-core GPU, and how I couldn't see that ever working as general purpose CPUs in the software of today. With 3D stacking, I think we'll reach a practical core limit on a single socket, and just won't need the complexity (including NUMA) of multi socket anymore.

dragontamer · on July 7, 2021

But its the same in GPU-land.

There are plenty of tasks which max out at one block / threadgroup of 1024 CUDA-threads. cudaMemcpy is a silly example (probably maxes out memory bandwidth at just 64 or 32 cores used), but there are plenty of tasks that simply don't scale to the full use of a GPU.

Just because some tasks (many tasks?) fail to use more than 32-GPU cores doesn't mean that GPU-parallelism is useless. It just means that when you program those particular tasks, only use 32-GPU cores!! Then use the GPU-cores on _other_ tasks (possibly in parallel).

IIRC, cudaMalloc, and many other primitives in the CUDA framework, has been shown to have very little parallelism at all. You need to work at keeping this "sequential-code" outside of your inner loops. (Runs on CUDA-stream #0, which for older hardware at least is sequentially scheduled)

----------

1. Some tasks can effectively use infinite cores (SIMD-threads really for GPUs... but same idea since a SIMD-lane can largely emulate a thread as long as you're careful about branch divergence)

2. Some tasks can be parallelized at the application / operator level. Run many applications in parallel ("Makefile parallelism")

3. Some tasks (memcpy) are so memory-bound that parallelism will never help.

4. Some tasks have a better solution that becomes feasible with more compute power.

---------

Lets take a CPU example: H.264 encoding. IIRC, this task barely scales to 8 cores and has diminishing returns beyond that.

But an example of #2 would be Youtube: you have one encoding machine that handles transcoding in parallel. You don't run just 1 instance of the problem (using 8 cores), you run 32 in parallel, and each of those 32-instances can effectively use 8-cores for H264 encoding.

And #4 can still happen: H264 is pretty easy for modern computers with little chance of parallelism. Switching to H.265 or even to AV1 will increase the compute power needed, and allow scaling of single tasks up to 16 to 64 cores. Now your 2x128 hypothetical machine can only run 4 transcoding sessions at a time.

-----------

The dual-socket machine for a transcoding cluster is still superior over a single-socket machine. 10Gbit Ethernet is more than sufficient to handle 4x AV1 sessions (especially because AV1 is slower than realtime), so right there we've cut the number of Ethernet cables in half, which means we've cut the number of 10 Gbit Switches in half.

Having the two sockets share one Ethernet port is an efficiency gain even if you don't have any task-communication going on: if only for the I/O sharing capability of the NUMA Fabric.