An article from Ars Technica [1] on this topic makes the following point:
"Unlike most Azure compute resources, which are typically shared between customers, the Cray supercomputers will be dedicated resources. This suggests that Microsoft won't be offering a way of timesharing or temporarily borrowing a supercomputer. Rather, it's a way for existing supercomputer users to colocate their systems with Azure to get the lowest latency, highest bandwidth connection to Azure's computing capabilities rather than having to have them on premises."
Somewhat interestingly, this sounds like a bit like hybrid cloud except that it's hosted entirely in Azure datacenters rather than partially on-premises.
With distributed computing and cloud hosting allowing for thousands of instances, with huge amounts of resources such as obscene terabytes of RAM, is there still a need for supercomputers? They are, effectively, the same thing, right?
I am guessing that I'm not understanding something fully. I don't really see the benefit anymore, now that you can lease thousands of cores, petabytes of disk, and multiple terabytes of RAM.
What's the benefit? What am I missing? Google is none too helpful.
TL;DR - It does come down mainly to the network, but in far more interesting ways than is apparent from some of the answers here - and also the nature of the HPC software ecosystem that co-evolved with supercomputing for the last 30+ years. This community has pioneered several key ideas in large scale computing that seem to be at risk in the world of cheap, lease-able compute.
In scientific computing (usually where you see them), the primary workload is simulation/modeling of natural phenomena. The nature of this workload is that the more parallelism that is available, the bigger/more fine-grained a simulation can run, and hence the better it can approximate reality (as defined by the scientific models which are being simulated). Examples of this are fluid dynamics, multi-particle physics, molecular dynamics, etc.
The big push with these types of workloads is to be able to get efficient parallel performance at scale - so it isnt about just the # of cores, PB of disk or TB of DRAM, but whether the software and underlying hardware work well together at scale to exploit the available aggregate compute.
So the network matters, not just raw bandwidth but things like latency of remote memory access and the topology itself - for example, the Cray XCs going to Azure allow for a programming model (PGAS) that allows for large, scalable global memory views where a program can view the total memory of a set of nodes as a single address space. Underneath, the hardware and software work together to bound latency, do adaptive per-packet routing and ensure reliability - all at the level of 10s of thousands of nodes. In a real sense, the network is the (super)computer - the old Sun slogan.
Where else is this useful? Well, look at deep learning - the new hotness in parallel computing these days - they are all realizing that it's amazing to run on GPUs, but once you have large enough problems (which the big guys do), you end up having to figure out how to efficiency get a bunch of GPUs to efficiently communicate during data parallel training (that efficient parallelism thing). This happens to map to a relatively simple set of communication patterns (e.g. AllReduce) that is a small subset of the kinds that the HPC community has solved for - so it's interesting that many deep learning engineers are starting to see the value of things like RDMA and frameworks like MPI (Baidu, Uber, MSFT and Amazon for starters).
Interestingly though, the word supercomputing is being co-opted by the very companies that you're positioning as the alternative - the Google TPU Cloud is a specialized incarnation of a typical supercomputing architecture. Sundar Pichai refers to Google as being a 'supercomputer in your pocket'.
Alas, HN only allows me to vote your comment up once.
I really love the detailed and expansive responses that some questions generate. Hopefully, they are of interest to more than just myself. I've been retired since 2007, so it is living vicariously through you folks - as I absolutely don't have to make these choices anymore.
A part of me wants to take on some big project, just to get back into it. I actually miss working. Go figure?
> Now to find out what kind of applications are greatly benefited from this.
The common applications are scientific computing usually, here's a quick overview[1] of the kind of algorithms ran on supercomputers: PDEs, ODEs, FFTs, Sparse and Dense linear algebra, etc.
These are usually used for scientific applications like weather forecasting, where you need to know about the result on time (i.e. before the hurricane reaches the coast!)
Yes, big hyperscalers are adopting CLOS networks as used in HPC for decades. 400ns switching, per se, is not that bad (IIRC 100G IB switches have around 100ns switching cost). Though I wonder when they are doing ECMP at the L3 level what the latency is when crossing multiple switches (for comparison, IB does ECMP at the L2 level)? Also, you need software to take advantage of this as well. MPI stacks tend to use RDMA with user-space networking. For ethernet, there is RoCE v1/v2 which is comparable to IB RDMA, though I'm not sure how mature that is at this point.
Certainly it's true that ethernet has stepped up the game in recent years, while at the same time IB has more or less become a single-vendor play. So it'll be interesting to see what the future holds.
Further, proprietary supercomputer networks have things like adaptive non-minimal routing, which enables efficient use of network topologies that are more affordable at scale, such as flattened butterfly, dragonfly etc. AFAIK neither IB nor ETH + IP-level ECMP support anything like that.
For special topologies, one can put multiple IB nics in a single host. Nvidia has support for doing DMA transfers directly into GPU memory from 3d-party PCIe devices [0]
Supercomputers do optimimize for different things, e.g. they have faster and lower latency network interconnects. But, over time the differentiation will diminish as the public cloud providers invest more.
You can't beat economics and the public cloud market will grow to be much larger than the supercomputing market. This is similar to why supercomputers switched from bespoke processors to commidity x86.
Parallel processing is about network latency and bandwidth for the types of algorithms that don't divide into small computable/paralelizable bits easily. For those tasks, supercomputer buses are unmatched.
That makes some sense, so I'll just post my thanks in this one reply so I don't have to thank everyone individually.
Much appreciated.
It does lead me to one additional question - is the need for additional speed great enough to justify this? I don't know how much faster it would be and I tried Google and they are not even remotely helpful. I may just be using the wrong query phrases.
> is the need for additional speed great enough to justify this?
Yep. I did molecular dynamics simulations on the Titan supercomputer, and also tried some on Azure's cloud platform (using Infiniband). The results weren't even close.
When you say the results weren't even close, and if you have time and don't mind, could you share some numbers/elaborate on that?
My experience with HPC is fairly limited, compared to what I think you're discussing. In my case, it was things like blade servers which was a cost decision. We also didn't have the kind of connectivity and speeds that you have available today.
(I modeled traffic at rather deep/precise levels.)
So, if you have some experience numbers AND you have the free time, I'd love to learn more about the differences in the results? Were the benefits strictly time? If so, how much time are we talking about? If you had to personally pay the difference, which one would you select?
Thanks for giving me some of your time. I absolutely appreciate it.
The problem with molecular modeling/ protein folding algorithms is that each molecule interacts with every other molecule. So, you have an n^2 problem. Sure, hueristics get this way down to nlogn but what fundamentally slows it is not the growth in computation but the growth in the data being pushed around. Each node needs the data from every node's last time step. For these problems, doubling the nodes of a cluster computer might not even noticeable improve the speed of the algorithm. When I was helping people run these, they were looking at a few milliseconds of computation for a small number of molecules that took a few weeks of supercomputer time. So lots and lots of computing generations were/are needed before we get anywhere close to what they want to model.
I wonder why they think I was actually wanting the oversimplified stuff? I thought I'd made it clear that I wanted the technical details from their experience. Ah well... Your response is better than mine would have been.
Mainly network, as others have said. But there are also scientific applications that are more appropriate for distributed computing (don't need the fast interconnect), but get run on supercomputers anyway because of data locality (post-processing/analysis) or grant structures.
The cool thing here (hopefully) is that it makes it easy to have both: HPC/supercomputer for jobs that need that and cheaper/easier cloud resources for jobs that don't.
I think the key is in your parent, with the focus on increasing overall performance through decreasing latency (as opposed to increasing, say, parallelism).
They still could be shared on a job by job base. For jobs that need the ridiculously fast interconnects (large datasets to be worked on by thousands of cores) you go Cray, when they don't, you go commodity Azure.
Interesting what tasks require both, supercomputer and normal systems, and for them to be tightly connected. Until of course it is consultant BS/upsell trying to cover/workaround performance issues elsewhere in the stack.
Not a lot. They sold off their interconnect technology which was their last real secret sauce. Now they really just sell turnkey x86 HPC clusters with some mainframe-esque job dispatch software.
Even though Cray sold the rights to Aries (the interconnect) to Intel, they still sell x86 Aries based systems - XC30, XC40, XC50. They also sell more generic HPC clusters, originally from Appro whom they bought in 2012.
...turnkey HPC clusters with liquid cooling, DC power distribution, extraordinarily dense compute blades, custom HPC network fabrics, and embedded management processors throughout to gather monitoring data.
It's a shame you can't get a cable to link computers via PCIe (supports cables over 3m[0]). Multiple x16 connections that would lead to some decent bandwidth while being low-latency, without the network overhead or shelling out for high-end switches.
Using dual port adapters, you can do a three-node Infiniband ring with no routers involved and get full speed. Dual port FDR (40 Gbit/s) cards are about $150 on Ebay nowadays. If you've bought the hardware to warrant needing such a setup (i.e. 3 dual-socket Xeon servers), that's a very negligible cost.
PCIe fabrics exist but they're not as great as you might expect. Because of the slow 8 Gbps lane speed the cables are enormous and the switches support very few ports. And the DMA engine may not give you the performance you expect if it doesn't have NIC-like features.
Cray has produced (still producing?) some compilers for HPC applications that are considered pretty decent. At the very least, some very large legacy codes got caught up in the vendor lock-out - in software that usually means non-portable preprocessor directives for example.
Could be interesting for continuous integration and testing new Cray setups.
I usually have to spend a week or so to adapt our builds once we get a system upgrade. It's mostly to hack around weird Cray setups and because we dare to link C++ and Fortran code bases on GPU.
SRC Labs was started by Seymour Cray just before he died and they have started transferring patents to a Native American tribe to avoid review by the patent office's review board.
That kind of behavior really hurts the Cray name in my eyes. Microsoft should find a better partner.
Cray Research, from which Seymour departed long before, was acquired by SGI. When SGI cratered, the SGI and Cray _names_ were separately sold off. "Cray" was bought by Tera, a faltering supercomputer maker in Seattle, who adopted that name (sort of a HPC witness protection program thing). They're as much "Cray" as gadgets sold at Target are "Westinghouse".
SRC is a distant descendant of the 3rd corporate vehicle started by Seymour (Cray Research, Cray Computer, then SRC).
Tera also brought along many of the employees that were part of Cray Research. The Cray Inc that exists now isn't exactly CRI, but it started with the same code and same people, plus a few SGI defectors. It should be considered the spiritual successor.
I also found this article: https://www.reuters.com/article/us-usa-patents-nativeamerica...
It claims that the tribe SRC used for this is suing Microsoft and Amazon for patents related to super computing. It looks like a federal court can still invalidate patents even if that board cannot.
"Unlike most Azure compute resources, which are typically shared between customers, the Cray supercomputers will be dedicated resources. This suggests that Microsoft won't be offering a way of timesharing or temporarily borrowing a supercomputer. Rather, it's a way for existing supercomputer users to colocate their systems with Azure to get the lowest latency, highest bandwidth connection to Azure's computing capabilities rather than having to have them on premises."
Somewhat interestingly, this sounds like a bit like hybrid cloud except that it's hosted entirely in Azure datacenters rather than partially on-premises.
[1] https://arstechnica.com/gadgets/2017/10/cray-supercomputers-...