Assuming you mean 99.9999%; your hyperscaler isn't giving you that. MTBF is comparable.
It's hardware at the end of the day, the VM hypervisor isn't giving you anything on GPU instances because those GPU instances aren't possible to live-migrate. (even normal VMs are really tricky).
In a country with a decent power grid and a UPS (or if you use a colo-provider) you're going to get the same availability guarantee of a machine, maybe even slightly higher because less moving parts.
I think this "cloud is god" mentality betrays the fact that server hardware is actually hugely reliable once it's working; and the cloud model literally depends on this fact. The reliability of cloud is simply the reliability of hardware; they only provided an abstraction on management not on reliability.
I think people just don't realize how big computers have gotten since 2006. A t2.micro was an ok desktop computer back then. Today you can have something 1000 times as big for a few tens of thousands. You can easily run a company that serves the whole of the US out of a closet.
It's just wild to me how seemingly nobody is exploiting this.
Our industry has really lost sight of reality and the goals we're trying to achieve.
Sufficient scalability, sufficient performance, and as much developer productivity as we can manage given the other two constraints.
That is the goal, not a bunch of cargo-culty complex infra. If you can achieve it with a single machine, fucking do it.
A monolith-ish app, running on e.g. an Epyc with 192 cores and a couple TB of RAM???? Are you kidding me? That is so much computing power, to the point where for a lot of scenarios it can replace giant chunks of complex cloud infrastructure.
And for something approaching a majority of businesses it can probably replace all of it.
(Yes, I know you need at least one other "big honkin server", located elsewhere, for failover. And yes, this doesn't work for all sets of requirements, etc)
I feel this every day I talk with cloud-brained coworkers.
I manage an infrastructure with tens of thousands of VMs and everyone is obsessed with auto scaling and clustering and every other thing the vendor sales dept shoved down their throats while simultaneously failing to realize that they could spend <5% of what we currently do and just use the datacenter cages we _already have_ and a big fat rack of 2S 9754 1U servers.
The kicker? These VMs are never more than 8 cores a piece, and applications never scale to more than 3 or 4 in a set. With sub 40% CPU utilization each. Most arguments against cloud abuse like this get ignored because VPs see Microsoft (Azure in this case) as some holy grail for everything and I frankly don't have it in me to keep fighting application dev teams that don't know anything about server admin.
And that's without getting into absolutely asinine price/perf SaaS offerings like Cosmos DB.
Well the problem nowadays is what can be done has become what must be done. totally bypassing on question of what should be done. So now instead of single service serving 5 million requests in a business is replaced by 20 micro services generating traffic of 150 million requests with distributed transactions, logging (MBs of log per request), monitoring, metrics and so on. All leading to massive infrastructure bloat. Do it for dozen more applications and future is cloudy now.
Once management is convinced by sales people or consultants any technical argument can be brushed away as not seeing the strategic big picture of managing enterprise infrastructure.
Yes. That's a risk assessment every company must make. What's the probability of downtime vs the development slowdown and the operating costs of a fully redundant infrastructure?
I worked for a payments company (think credit cards). We designed the system to maintain very high availability in the payment flow. Multi-region, multi-AZ in AWS. But all other flows such as user registration, customer care or even bill settlement had to stop during that one incident when our main datacenter lost power after a testing switch. The outage lasted for three hours and it happened exactly once in five years.
In that specific case, investing into higher availability by architecting in more redundancy would not be worth it. We had more downtime caused by bad code and not well thought out deployments. But that risk equation will be different for everyone.
Very few businesses are living and breathing by their system uptime. Sure, it's bad, but having a recovery plan and good backups (or modest multi-site redundancy, if you're really worried) is sufficient for most.
As someone who has done a bunch of large scale ML on hyperscaler hardware I will say the uptime is nowhere near 99.9999%. Given a cluster of only a few hundred GPUs one or multiple failures is a near certainty to the point where we spend a bunch of time on recovery time optimization.
> The reliability of cloud is simply the reliability of hardware; they only provided an abstraction on management not on reliability.
This isn't really true. I mean it's true in the sense that you could get the same reliability on-premise given a couple decades of engineer hours, but the vast majority of on-premise deployments I have seen have significantly lower reliability than clouds and have few plans to build out those capabilities.
E.g. if I exclude public cloud operator employers, I've never worked for a company that could mimick an AZ failover on-prem and I've worked for a couple F500s. As far as I can recall, none of them have even segmented their network beyond the management plane having its own hardware. The rest of the DC network was centralized; I recall one of them in specific because an STP loop screwed up half of it at one point.
Part of paying for the cloud is centralizing the costs of thinking up and implementing platform-level reliability features. Some of those things are enormously expensive and not really practical for smaller economies of scale.
Just one random example is tracking hardware-level points of failure and exposing that to the scheduler. E.g. if a particular datacenter has 4 supplies from mains and each rack is only connected to a single one of those supplies, when I schedule 4 jobs to run there it will try to put each job in a rack with a separate power supply to minimize the impact of losing a mains. Ditto with network, storage, fire suppression, generators, etc, etc, etc.
That kind of thing makes 0 economic sense for an individual company to implement, but it starts to make a lot of sense for a company who does basically nothing other than manage hardware failures.
Some of the cloud providers don't even do live-migration. They adhere to the cloud mantra of "oh well, its up to the customer to spin up and carry on elsewhere".
I have it on good authority that some of them don't even take A+B feeds to their DC suites - and then have the chutzpah to shout at the DC provider when their only feed goes down, but that's another story... :)
For what it's worth, GCP routinely live-migrates customer VMs to schedule hardware for maintenance/decommissioning when hardware sensors start indicating trouble. It's standard everyday basic functionality by now, but only for the vendors who built the feature in from the beginning.
meta: I'm always interested how the votes go on comments like this. I've been watching periodically and it seems like I get "-2" at random intervals.
This is not the first time that "low yield" karma comments have sporadic changes to their votes.
It seems unlikely at the rate of change (roughly 3-5 point changes per hour) that two people would simultaneously (within a minute) have the same desire to flag a comment, so I can only speculate that:
A) Some people's flag is worth -2
B) Some people, passionate about this topic, have multiple accounts
C) There's bots that try to remain undetected by making only small adjustments to the conversation periodically.
I'm aware that some peoples job very strongly depends on the cloud, but nothing I said could be considered off topic or controversial: Cloud for GPU compute relies on hardware reliability just like everything else does. This is fact. Regardless of this, the voting behaviour on my comments such as this are extremely suspicious.
Assuming you mean 99.9999%; your hyperscaler isn't giving you that. MTBF is comparable.
It's hardware at the end of the day, the VM hypervisor isn't giving you anything on GPU instances because those GPU instances aren't possible to live-migrate. (even normal VMs are really tricky).
In a country with a decent power grid and a UPS (or if you use a colo-provider) you're going to get the same availability guarantee of a machine, maybe even slightly higher because less moving parts.
I think this "cloud is god" mentality betrays the fact that server hardware is actually hugely reliable once it's working; and the cloud model literally depends on this fact. The reliability of cloud is simply the reliability of hardware; they only provided an abstraction on management not on reliability.