I think people just don't realize how big computers have gotten since 2006. A t2...

JohnBooty · on Oct 11, 2024

It's just wild to me how seemingly nobody is exploiting this.

Our industry has really lost sight of reality and the goals we're trying to achieve.

Sufficient scalability, sufficient performance, and as much developer productivity as we can manage given the other two constraints.

That is the goal, not a bunch of cargo-culty complex infra. If you can achieve it with a single machine, fucking do it.

A monolith-ish app, running on e.g. an Epyc with 192 cores and a couple TB of RAM???? Are you kidding me? That is so much computing power, to the point where for a lot of scenarios it can replace giant chunks of complex cloud infrastructure.

And for something approaching a majority of businesses it can probably replace all of it.

(Yes, I know you need at least one other "big honkin server", located elsewhere, for failover. And yes, this doesn't work for all sets of requirements, etc)

uid65534 · on Oct 11, 2024

I feel this every day I talk with cloud-brained coworkers.

I manage an infrastructure with tens of thousands of VMs and everyone is obsessed with auto scaling and clustering and every other thing the vendor sales dept shoved down their throats while simultaneously failing to realize that they could spend <5% of what we currently do and just use the datacenter cages we _already have_ and a big fat rack of 2S 9754 1U servers.

The kicker? These VMs are never more than 8 cores a piece, and applications never scale to more than 3 or 4 in a set. With sub 40% CPU utilization each. Most arguments against cloud abuse like this get ignored because VPs see Microsoft (Azure in this case) as some holy grail for everything and I frankly don't have it in me to keep fighting application dev teams that don't know anything about server admin.

And that's without getting into absolutely asinine price/perf SaaS offerings like Cosmos DB.

JohnBooty · on Oct 21, 2024

I'm going to borrow the term "cloud-brained."

_3u10 · on Oct 11, 2024

Servers that big rarely fail anyway, everything is hotswap and redundant.

latchkey · on Oct 18, 2024

The GPUs/OAM,UBB are not redundant and they do fail.

From what I hear, Nvidia has an exceptionally high failure rate.

geodel · on Oct 11, 2024

Well the problem nowadays is what can be done has become what must be done. totally bypassing on question of what should be done. So now instead of single service serving 5 million requests in a business is replaced by 20 micro services generating traffic of 150 million requests with distributed transactions, logging (MBs of log per request), monitoring, metrics and so on. All leading to massive infrastructure bloat. Do it for dozen more applications and future is cloudy now.

Once management is convinced by sales people or consultants any technical argument can be brushed away as not seeing the strategic big picture of managing enterprise infrastructure.

dartos · on Oct 11, 2024

Well you’d probably also at least want a cdn in each region, so like 3 closets.

jgalt212 · on Oct 11, 2024

Cloudflare caching of static resources is cheap, so back to one closet. But three if you want to be pure and totally cloudless.

felixgallo · on Oct 11, 2024

With one closet you can also lose the entire business if one water pipe breaks or one wire goes bad in drywall. Back to three closets.

llm_trw · on Oct 11, 2024

If only it were possible to make backups. Alas no such technology exists.

LoganDark · on Oct 11, 2024

Yes. That's what the other closets are for. Redundancy.

dartos · on Oct 11, 2024

backups do not prevent downtime.

krab · on Oct 11, 2024

Yes. That's a risk assessment every company must make. What's the probability of downtime vs the development slowdown and the operating costs of a fully redundant infrastructure?

I worked for a payments company (think credit cards). We designed the system to maintain very high availability in the payment flow. Multi-region, multi-AZ in AWS. But all other flows such as user registration, customer care or even bill settlement had to stop during that one incident when our main datacenter lost power after a testing switch. The outage lasted for three hours and it happened exactly once in five years.

In that specific case, investing into higher availability by architecting in more redundancy would not be worth it. We had more downtime caused by bad code and not well thought out deployments. But that risk equation will be different for everyone.

llm_trw · on Oct 11, 2024

The 2018 server in my garage has had a better uptime than aws in the last 6 years.

packetlost · on Oct 11, 2024

Very few businesses are living and breathing by their system uptime. Sure, it's bad, but having a recovery plan and good backups (or modest multi-site redundancy, if you're really worried) is sufficient for most.

dartos · on Oct 11, 2024

Let’s call it a day and just go for a single colo.