Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cray and Microsoft Bring Supercomputing to Azure (cray.com)
148 points by rbanffy on Oct 23, 2017 | hide | past | favorite | 56 comments


An article from Ars Technica [1] on this topic makes the following point:

"Unlike most Azure compute resources, which are typically shared between customers, the Cray supercomputers will be dedicated resources. This suggests that Microsoft won't be offering a way of timesharing or temporarily borrowing a supercomputer. Rather, it's a way for existing supercomputer users to colocate their systems with Azure to get the lowest latency, highest bandwidth connection to Azure's computing capabilities rather than having to have them on premises."

Somewhat interestingly, this sounds like a bit like hybrid cloud except that it's hosted entirely in Azure datacenters rather than partially on-premises.

[1] https://arstechnica.com/gadgets/2017/10/cray-supercomputers-...


With distributed computing and cloud hosting allowing for thousands of instances, with huge amounts of resources such as obscene terabytes of RAM, is there still a need for supercomputers? They are, effectively, the same thing, right?

I am guessing that I'm not understanding something fully. I don't really see the benefit anymore, now that you can lease thousands of cores, petabytes of disk, and multiple terabytes of RAM.

What's the benefit? What am I missing? Google is none too helpful.


TL;DR - It does come down mainly to the network, but in far more interesting ways than is apparent from some of the answers here - and also the nature of the HPC software ecosystem that co-evolved with supercomputing for the last 30+ years. This community has pioneered several key ideas in large scale computing that seem to be at risk in the world of cheap, lease-able compute.

In scientific computing (usually where you see them), the primary workload is simulation/modeling of natural phenomena. The nature of this workload is that the more parallelism that is available, the bigger/more fine-grained a simulation can run, and hence the better it can approximate reality (as defined by the scientific models which are being simulated). Examples of this are fluid dynamics, multi-particle physics, molecular dynamics, etc.

The big push with these types of workloads is to be able to get efficient parallel performance at scale - so it isnt about just the # of cores, PB of disk or TB of DRAM, but whether the software and underlying hardware work well together at scale to exploit the available aggregate compute.

So the network matters, not just raw bandwidth but things like latency of remote memory access and the topology itself - for example, the Cray XCs going to Azure allow for a programming model (PGAS) that allows for large, scalable global memory views where a program can view the total memory of a set of nodes as a single address space. Underneath, the hardware and software work together to bound latency, do adaptive per-packet routing and ensure reliability - all at the level of 10s of thousands of nodes. In a real sense, the network is the (super)computer - the old Sun slogan.

Where else is this useful? Well, look at deep learning - the new hotness in parallel computing these days - they are all realizing that it's amazing to run on GPUs, but once you have large enough problems (which the big guys do), you end up having to figure out how to efficiency get a bunch of GPUs to efficiently communicate during data parallel training (that efficient parallelism thing). This happens to map to a relatively simple set of communication patterns (e.g. AllReduce) that is a small subset of the kinds that the HPC community has solved for - so it's interesting that many deep learning engineers are starting to see the value of things like RDMA and frameworks like MPI (Baidu, Uber, MSFT and Amazon for starters).

Interestingly though, the word supercomputing is being co-opted by the very companies that you're positioning as the alternative - the Google TPU Cloud is a specialized incarnation of a typical supercomputing architecture. Sundar Pichai refers to Google as being a 'supercomputer in your pocket'.


Alas, HN only allows me to vote your comment up once.

I really love the detailed and expansive responses that some questions generate. Hopefully, they are of interest to more than just myself. I've been retired since 2007, so it is living vicariously through you folks - as I absolutely don't have to make these choices anymore.

A part of me wants to take on some big project, just to get back into it. I actually miss working. Go figure?

Thanks!


Personally I think of Google as being Orwell's 1984 'telescreen in your pocket'.


Do you get MPI ping-pong latencies on the order of a microsecond on a "normal" public cloud?

No? Well, MPI applications that are sensitive to latency is one usecase where a "real" supercomputer can be useful.


Ah! This is even better. It led me to finding this:

http://icl.cs.utk.edu/hpcc/hpcc_results_lat_band.cgi

I do believe I get it. Now to find out what kind of applications are greatly benefited from this.

Thanks HN! You always make me dig in and learn new things!

Edit: It was "MPI latency" that led me to that result, by the way.


> Now to find out what kind of applications are greatly benefited from this.

The common applications are scientific computing usually, here's a quick overview[1] of the kind of algorithms ran on supercomputers: PDEs, ODEs, FFTs, Sparse and Dense linear algebra, etc.

These are usually used for scientific applications like weather forecasting, where you need to know about the result on time (i.e. before the hurricane reaches the coast!)

[1] https://www.nap.edu/read/11148/chapter/7#125


I'll scrape the whole book and read it. Thanks! I know weather models still do it on supercomuters but understood the currently have plenty.

I look forward to reading the book.


Public cloud, maybe not that far away. Linkedin's new data centers have sub 400ns switching and 100G interconnects. https://engineering.linkedin.com/blog/2016/03/project-altair...


Yes, big hyperscalers are adopting CLOS networks as used in HPC for decades. 400ns switching, per se, is not that bad (IIRC 100G IB switches have around 100ns switching cost). Though I wonder when they are doing ECMP at the L3 level what the latency is when crossing multiple switches (for comparison, IB does ECMP at the L2 level)? Also, you need software to take advantage of this as well. MPI stacks tend to use RDMA with user-space networking. For ethernet, there is RoCE v1/v2 which is comparable to IB RDMA, though I'm not sure how mature that is at this point.

Certainly it's true that ethernet has stepped up the game in recent years, while at the same time IB has more or less become a single-vendor play. So it'll be interesting to see what the future holds.

Further, proprietary supercomputer networks have things like adaptive non-minimal routing, which enables efficient use of network topologies that are more affordable at scale, such as flattened butterfly, dragonfly etc. AFAIK neither IB nor ETH + IP-level ECMP support anything like that.


For special topologies, one can put multiple IB nics in a single host. Nvidia has support for doing DMA transfers directly into GPU memory from 3d-party PCIe devices [0]

[0] http://docs.nvidia.com/cuda/gpudirect-rdma/index.html


Supercomputers do optimimize for different things, e.g. they have faster and lower latency network interconnects. But, over time the differentiation will diminish as the public cloud providers invest more.

You can't beat economics and the public cloud market will grow to be much larger than the supercomputing market. This is similar to why supercomputers switched from bespoke processors to commidity x86.

AWS already has several instance types with 25Gbit ethernet, for instance: http://www.ec2instances.info/?cost_duration=monthly&reserved...


they have faster and lower latency network interconnects

It will not be possible to replicate Ares - which is itself a moving target - for general workloads and still be competitive on price.


Just wanted to point out, Azure has several VM instances with 30Gbps ethernet that were recently announced: https://azure.microsoft.com/en-us/blog/azure-networking-anno...

These include D64v3, Ds64v3, E64v3, Es64v3, and M128ms VMs.


Parallel processing is about network latency and bandwidth for the types of algorithms that don't divide into small computable/paralelizable bits easily. For those tasks, supercomputer buses are unmatched.


That makes some sense, so I'll just post my thanks in this one reply so I don't have to thank everyone individually.

Much appreciated.

It does lead me to one additional question - is the need for additional speed great enough to justify this? I don't know how much faster it would be and I tried Google and they are not even remotely helpful. I may just be using the wrong query phrases.


> is the need for additional speed great enough to justify this?

Yep. I did molecular dynamics simulations on the Titan supercomputer, and also tried some on Azure's cloud platform (using Infiniband). The results weren't even close.


When you say the results weren't even close, and if you have time and don't mind, could you share some numbers/elaborate on that?

My experience with HPC is fairly limited, compared to what I think you're discussing. In my case, it was things like blade servers which was a cost decision. We also didn't have the kind of connectivity and speeds that you have available today.

(I modeled traffic at rather deep/precise levels.)

So, if you have some experience numbers AND you have the free time, I'd love to learn more about the differences in the results? Were the benefits strictly time? If so, how much time are we talking about? If you had to personally pay the difference, which one would you select?

Thanks for giving me some of your time. I absolutely appreciate it.


The problem with molecular modeling/ protein folding algorithms is that each molecule interacts with every other molecule. So, you have an n^2 problem. Sure, hueristics get this way down to nlogn but what fundamentally slows it is not the growth in computation but the growth in the data being pushed around. Each node needs the data from every node's last time step. For these problems, doubling the nodes of a cluster computer might not even noticeable improve the speed of the algorithm. When I was helping people run these, they were looking at a few milliseconds of computation for a small number of molecules that took a few weeks of supercomputer time. So lots and lots of computing generations were/are needed before we get anywhere close to what they want to model.


seriously.. take a 1/2 second to think.

(compute-chunk-time * latency * nchunks * ncomms) / n-nodes

obviously this oversimplifies things, but generally as an approximation, there you go.

then merge this in with your cost/time equation, and make the call.


seriously.. take a 1/2 second to think.

Don't do this on HN please.


I wonder why they think I was actually wanting the oversimplified stuff? I thought I'd made it clear that I wanted the technical details from their experience. Ah well... Your response is better than mine would have been.


And Azure's MPI/IB support is the best of all the cloud providers.


Mainly network, as others have said. But there are also scientific applications that are more appropriate for distributed computing (don't need the fast interconnect), but get run on supercomputers anyway because of data locality (post-processing/analysis) or grant structures.

The cool thing here (hopefully) is that it makes it easy to have both: HPC/supercomputer for jobs that need that and cheaper/easier cloud resources for jobs that don't.


Fast low latency interconnects, and scale up versions of the resources (eg lots of gpus per box) to minimise communication overhead.


I think the key is in your parent, with the focus on increasing overall performance through decreasing latency (as opposed to increasing, say, parallelism).


Internode latency, when you have one big problem and communication is the bottleneck.


They still could be shared on a job by job base. For jobs that need the ridiculously fast interconnects (large datasets to be worked on by thousands of cores) you go Cray, when they don't, you go commodity Azure.

Think of it as a rack-sized GPU.


Interesting what tasks require both, supercomputer and normal systems, and for them to be tightly connected. Until of course it is consultant BS/upsell trying to cover/workaround performance issues elsewhere in the stack.


This sounds like the sort of thing that was developed for one specific high-value customer. I wonder who it was.


It seems like it might be a good fit for US national labs -- when a job doesn't involve ITAR or anything classified.


Azure Government has datacenters that can support ITAR and DOD workloads. A Cray system could be deployed in one of those datacenters.


Wonder whats left of old Cray besides the name ?

Book tip: https://www.amazon.com/Supermen-Seymour-Technical-Wizards-Su...


Not a lot. They sold off their interconnect technology which was their last real secret sauce. Now they really just sell turnkey x86 HPC clusters with some mainframe-esque job dispatch software.


Even though Cray sold the rights to Aries (the interconnect) to Intel, they still sell x86 Aries based systems - XC30, XC40, XC50. They also sell more generic HPC clusters, originally from Appro whom they bought in 2012.


...turnkey HPC clusters with liquid cooling, DC power distribution, extraordinarily dense compute blades, custom HPC network fabrics, and embedded management processors throughout to gather monitoring data.


It's a shame you can't get a cable to link computers via PCIe (supports cables over 3m[0]). Multiple x16 connections that would lead to some decent bandwidth while being low-latency, without the network overhead or shelling out for high-end switches.

You can skip halfway. [0]https://www.youtube.com/watch?v=q5xvwPa3r7M


Using dual port adapters, you can do a three-node Infiniband ring with no routers involved and get full speed. Dual port FDR (40 Gbit/s) cards are about $150 on Ebay nowadays. If you've bought the hardware to warrant needing such a setup (i.e. 3 dual-socket Xeon servers), that's a very negligible cost.


PCIe fabrics exist but they're not as great as you might expect. Because of the slow 8 Gbps lane speed the cables are enormous and the switches support very few ports. And the DMA engine may not give you the performance you expect if it doesn't have NIC-like features.

https://www.broadcom.com/applications/datacenter-networking/...


Cray has produced (still producing?) some compilers for HPC applications that are considered pretty decent. At the very least, some very large legacy codes got caught up in the vendor lock-out - in software that usually means non-portable preprocessor directives for example.


I think I clicked through to the comments because I was wondering exactly this. Thanks for the rec.


Chapel the programming language is really from Cray, in the spirit of classical Cray.


Could be interesting for continuous integration and testing new Cray setups.

I usually have to spend a week or so to adapt our builds once we get a system upgrade. It's mostly to hack around weird Cray setups and because we dare to link C++ and Fortran code bases on GPU.


How much?


If you need to ask... you definitely can’t afford it.


I'd be very surprised if any company purchased this without asking how much. That would be ridiculous.


And affordability has what relevance to the question?


s'ok, I got it ;)


Bravo ryanwaite!


SRC Labs was started by Seymour Cray just before he died and they have started transferring patents to a Native American tribe to avoid review by the patent office's review board.

That kind of behavior really hurts the Cray name in my eyes. Microsoft should find a better partner.

https://www.reuters.com/article/us-allergan-patents/tech-ent...


SRC isn't Cray at all though.

Cray Research, from which Seymour departed long before, was acquired by SGI. When SGI cratered, the SGI and Cray _names_ were separately sold off. "Cray" was bought by Tera, a faltering supercomputer maker in Seattle, who adopted that name (sort of a HPC witness protection program thing). They're as much "Cray" as gadgets sold at Target are "Westinghouse".

SRC is a distant descendant of the 3rd corporate vehicle started by Seymour (Cray Research, Cray Computer, then SRC).


Tera also brought along many of the employees that were part of Cray Research. The Cray Inc that exists now isn't exactly CRI, but it started with the same code and same people, plus a few SGI defectors. It should be considered the spiritual successor.


I also found this article: https://www.reuters.com/article/us-usa-patents-nativeamerica... It claims that the tribe SRC used for this is suing Microsoft and Amazon for patents related to super computing. It looks like a federal court can still invalidate patents even if that board cannot.


That article is about the FDA and a melanoma drug. I don't think it has anything to do with Cray or patents.

Possibly mis-copied link?


Yep. Thanks for the heads up. I fixed it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: