Hacker Newsnew | past | comments | ask | show | jobs | submit | endymi0n's commentslogin

„Open Access APIs are like a subway. You use them to capture a market and then you get out.“

— Erdogan, probably.


G is tame. Wait until you hear of Databricks’ Series K…

https://www.thesaasnews.com/news/databricks-raises-1b-series...


A few years back, well ok maybe almost ten now, but regardless- a recruiter reached out to me about a role at a "series G" company like it was a selling point, and I was just kinda like ok maybe thats signaling its relatively stable and can raise money, but at the same time, that's a lot of rounds to have preferences ensure unprivileged shareholders get nothing, and also to have most of the hockey stick growth already tapped out.

This was in the middle of the boom when companies were fighting over talent, so I found it odd.


One. Trillion. Even on native int4 that’s… half a terabyte of vram?!

Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…


The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/

The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).

The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).

At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.

The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.


Its often pointed out in the first sentence of a comment how a model can be run at home, then (maybe) towards the end of the comment it’s mentioned how it’s quantized.

Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.

The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.

By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?


> ...Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible. ...

int4 quantization is the original release in this case; it's not been quantized after the fact. It's a bit of a nuisance when running on hardware that doesn't natively support the format (might waste some fraction of memory throughput on padding, specifically on NPU hw that can't do the unpacking on its own) but no one here is reducing quality to make the model fit.


Good point thanks for the clarification.

The broader point remains though which is, “you can run this model as home…” when actually the caveats are potentially substantial.

It would be so incredibly slow…


From my own usage, the former is almost always better than the latter. Because it’s less like a lobotomy and more like a hangover, though I have run some quantized models that seem still drunk.

Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.

I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.

In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.


Interesting.

If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?


You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.

With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.


Hanlon's razor.

"Never attribute to malice that which is adequately explained by stupidity."

Yes, I'm calling labs that don't distill smaller sized models stupid for not doing so.


Didn't this paper demonstrate that you only need 1.58 bits to be equivalent to 16 bits in performance?

https://arxiv.org/abs/2402.17764


This technique showed that there are ways during training to optimize weights to neatly quantize while remaining performant. This isn't a post training quantization like int4.


For Kimi quantization is part of the training also. Specifically they say they use QAT, quantization aware training.

That doesn't mean training with all integer math, but certain tricks are used to specifically plan for the end weight size. I.e. fake quantization nodes are inserted to simulate int4.


Iirc the paper was solid, but it still hasn’t been adopted/proven out at large scale. Harder to adapt hardware and code kernels to something like this compared to int4.


just call it one trit


The level of deceit you're describing is kind of ridiculous. Anybody talking about their specific setup is going to be happy to tell you the model and quant they're running and the speeds they're getting, and if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around.


> if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around

Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.

8-bit vs 4-bit can be almost imperceptible or night and day.

This isn't something you'd necessarily see playing around, but when trying to do something specific


Except the parent comment said you can stream the weights from an SSD. The full weights, uncompressed. It takes a little longer (a lot longer), but the model at least works without lossy pre-processing.


> The model absolutely can be run at home. There even is a big community around running large models locally

IMO 1tln parameters and 32bln active seems like a different scale to what most are talking about when they say localLLMs IMO. Totally agree there will be people messing with this, but the real value in localLLMs is that you can actually use them and get value from them with standard consumer hardware. I don't think that's really possible with this model.


Local LLMs are just LLMs people run locally. It's not a definition of size, feature set, or what's most popular. What the "real" value is for local LLMs will depend on each person you ask. The person who runs small local LLMs will tell you the real value is in small models, the person who runs large local LLMs will tell you it's large ones, those who use cloud will say the value is in shared compute, and those who don't like AI will say there is no value in any.

LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.


> LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.

I agree. My point was that most aren't thinking of models this large when they're talking about local LLMs. That's what I said, right? This is supported by the download counts on hf: the most downloaded local models are significantly smaller than 1tln, normally 1 - 12bln.

I'm not sure I understand what point you're trying to make here?


Mostly a "We know local LLMs as being this, and all of the mentioned variants of this can provide real value regardless of which is most commonly referenced" point. I.e. large local LLMs aren't only something people mess with, they often provide a lot of value for a relative few people rather than a little value for a relative lot of people as small local LLMs do. Who thinks which modality and type brings the most value is largely a matter of opinion of the user getting the value, not just the option which runs on consumer hardware or etc alone.

You're of course accurate that smaller LLMs are more commonly deployed, it's just not the part I was really responding to.


32B active is nothing special, there's local setups that will easily support that. 1T total parameters ultimately requires keeping the bulk of them on SSD. This need not be an issue if there's enough locality in expert choice for any given workload; the "hot" experts will simply be cached in available spare RAM.


When I've measured this myself, I've never seen a medium-to-long task horizon that would have expert locality such that you wouldn't be hitting the SSD constantly to swap layers (not to say it doesn't exist, just that in the literature and in my own empirics, it doesn't seem to be observed in a way you could rely on it for cache performance).

Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.


> hitting the SSD constantly to swap layers

Thing is, people in the local llm community are already doing that to run the largest MoE models, using mmap such that spare-RAM-as-cache is managed automatically by the OS. It's a drag on performance to be sure but still somewhat usable, if you're willing to wait for results. And it unlocks these larger models on what's effectively semi-pro if not true consumer hardware. On the enterprise side, high bandwidth NAND Flash is just around the corner and perfectly suited for storing these large read-only model parameters (no wear and tear issues with the NAND storage) while preserving RAM-like throughput.


I've tested this myself often (as an aside: I'm in said community, I run 2x RTX Pro 6000 locally, 4x 3090 before that), and I think what you said re: "willing to wait" is probably the difference maker for me.

I can run Minimax 2.1 in 5bpw at 200k context fully offloaded to GPU. The 30-40 tk/s feels like a lifetime for long horizon tasks, especially with subagent delegation etc, but it's still fast enough to be a daily driver.

But that's more or less my cutoff. Whenever I've tested other setups that dip into the single and sub-single digit throughput rates, it becomes maddening and entirely unusable (for me).


What is bpw?


Bits per weight, its an average precision across all the weights. When you quantize these models, they don't just used a fixed precision size across all model layers/weights. There's a mix and it varies per quant method. This is why you can get bit precision that arent "real" in a strict computing sense.

e.g. A 4-bit quant can have half the attention and feed forward tensors in Q6, and the rest in Q4. Due to how block-scaling works, those k-quant dtypes (specifically for llama.cpp/gguf) have larger bpw than they suggest in their name. Q4 is around ~4.5 bpw, and Q6 is ~6.5.


I never said it was special.

I was trying to correct the record that a lot of people will be using models of this size locally because of the local LLM community.

The most commonly downloaded local LLMs are normally <30b (e.g. https://huggingface.co/unsloth/models?sort=downloads). The things you're saying, especially when combined together, make it not usable by a lot of people in the local LLM community at the moment.


do you guys understand that different experts are loaded PER TOKEN?


You can run AI models on unified/shared memory specifically on Windows, not Linux (unfortunately). It uses the same memory sharing system that Microsoft originally had built for gaming when a game would run out of vram. If you:

- have an i5 or better or equivalent manufactured within the last 5-7 years

- have an nvidia consumer gaming GPU (RTX 3000 series or better) with at least 8 GB vram

- have at least 32 GB system ram (tested with DDR4 on my end)

- build llama-cpp yourself with every compiler optimization flag possible

- pair it with a MoE model compatible with your unified memory amount

- and configure MoE offload to the CPU to reduce memory pressure on the GPU

then you can honestly get to about 85-90% of cloud AI capability totally on-device, depending on what program you interface with the model.

And here's the shocking idea: those system specs can be met by an off the shelf gaming computer from, for example, Best Buy or Costco today and right now. You can literally buy a CyberPower or iBuyPower model, again for example, download the source, run the compilation, and have that level of AI inference available to you.

Now, the reason why it won't work on Linux is that the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement. Which Nvidia hasn't done yet. You can code it somewhat into source code, but it's still super unstable and flaky from what I've read.

(In fact, that lack of unified memory tech on Linux is probably why everyone feels the need to build all these data centers everywhere.)


> Now, the reason why it won't work on Linux is that the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement. Which Nvidia hasn't done yet. You can code it somewhat into source code, but it's still super unstable and flaky from what I've read.

So it should work with an AMD GPU?


> the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement

Depends on if AMD (or Intel, since Arc drivers are supposedly OSS as well) took the time to implement that. Or if a Linux based OS/distro implements a Linux equivalent to the Windows Display Driver Model (needs code outside of the kernel and specific to the developed OS/distro to do).

So far, though, it seems like people are more interested in pointing fingers and sucking up the water of small town America than actually building efficient AI/graphics tech.


>The model absolutely can be run at home.

There is a huge difference between "look I got it to answer the prompt: '1+1='"

and actually using it for anything of value.

I remember early on people bought Macs (or some marketing team was shoveling it), and proposing people could reasonably run the 70B+ models on it.

They were talking about 'look it gave an answer', not 'look this is useful'.

While it was a bit obvious that 'integrated GPU' is not Nvidia VRAM, we did have 1 mac laptop at work that validated this.

Its cool these models are out in the open, but its going to be a decade before people are running them at a useful level locally.


Hear, hear. Even if the model fits, a few tokens per second make no sense. Time is money too.


If I can start an agent and be able to walk away for 8 hours, and be confident it's 'smart' enough to complete a task unattended, that's still useful.

At 3 tk/s, that's still 100-150 pages of a book, give or take.


True, that's still faster than a human, but they're not nearly that reliable yet.


Maybe for a coding agent, but a daily/weekly report on sensitive info?

If it were 2016 and this technology existed but only in 1 t/s, every company would find a way to extract the most leverage out of it.


If they figured out it can be this useful in 2016 running 1 t/s, they would make it run at least 20 t/s by 2019


But it's 2026 and 'secure' (by executive standards) hosted options exist.


> 'secure' (by executive standards)

"Secure" in the sense that they can sue someone after the fact, instead of preventing data from leaking in the first place.


I'd take "running at home" to mean running on reasonably available consumer hardware, which your setup is not. You can obviously build custom, but who's actually going to do that? OP's point is valid


How do you split the model between multiple GPUs?


With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000.

But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...


Which conveniently fits on one 8xH100 machine. With 100-200 GB left over for overhead, kv-cache, etc.


The unit economics seem pretty rough though. You're locking up 8xH100s for the compute of ~32B active parameters. I guess memory is the bottleneck but hard to see how the margins work on that.


Yes, it only makes sense economically if you have batching over many users.


VRAM is the new moat, and controlling pricing and access to VRAM is part of it. There will be very few hobbyists who can run models of this size. I appreciate the spirit of making the weights open, but realistically, it is impractical for >99.999% of users to run locally.


I run KimiK2 at home, Most of it on system ram with a few layers offloaded to old 3090s. This is a cheap budget build.

Kimi-K2-Thinking-UD-Q3_K_XL-00001-of-00010.gguf Generation - 5,231 tokens 604.63s 8.65 tokens/s


Could I trouble you for the specifics of your build? I'd love to see if it would be a viable upgrade for me.

I currently have a 3970x with a bunch of 3090s.


4 3090s, epyc MB with 8 channel memory, 7352 cpu, slow 2400mhz ddr4 rams.


that's what intelligence takes. Most of intelligence is just compute


3,998.99 for 500gb of RAM on amazon

"Good Luck" - Kimi <Taken voice>


I hate NOTHING quite the way how Claude jovially and endlessly raves about the 9/10 tasks it "succeeded" at after making them up, while conveniently forgetting to mention it completely and utterly failed at the main task I asked it to do.


That reminds me of the West Wing scene s2e12 "The Drop In" between Leo McGarry (White House Chief of Staff) and President Bartlet discussing a missile defense test:

LEO [hands him some papers] I really think you should know...

BARTLET Yes?

LEO That nine out of ten criterion that the DOD lays down for success in these tests were met.

BARTLET The tenth being?

LEO They missed the target.

BARTLET [with sarcasm] Damn!

LEO Sir!

BARTLET So close.

LEO Mr. President.

BARTLET That tenth one! See, if there were just nine...


An old adage comes to my mind: If you want something to be done the way you liked, do it yourself.


But it's a tool? Would you suggest driving a nail in by hand if someone complained about a faulty hammer?


AI is not an hammer. It's a thing you stick to a wall and push a button, and it drives tons of nails to the wall the way you wanted.

A better analogy would be a robot vacuum which does a lousy job.

In either case, I'd recommend using a more manual method, a manual or air-hammer or a hand driven wet/dry vacuum.


Meanwhile, core functionality like “Find My” is completely and utterly broken. Leaving behind my stuff at a new place gets me at least two different messages on my Apple Watch at different timing. One for my devices, one for my Apple tags. One only has a “dismiss” button, the other has a “trust location” button that when I click, it says “content unavailable”, and if it works (which is only! over Wi-Fi), then it only works for that one device. I always need to go through the find my app at every new place since it’s an absolute UX disaster.

That’s what I get for carrying only Apple gear in the thousands of euros with me.


You should try it if you're only a step into their ecosystem. Kiddo got some airpods for use with his android phone. Registered them with 'find my' on an iPhone SE I have for work that sits on my desk, so that when they get lost, we have a chance of finding them. Now, we get to be alerted to potential trackers every time when travel with him... even he gets the alerts, because he travels with his airpods frequently.

I've heard good things about Apple TV devices, but given what a pain Airpods are without the rest of the ecosystem, there's no way I'm going to try it.


I just don't understand how more people don't complain about this all the time. My wife has an apple tracker on her keys and I get a notification about an unknown tracker travelling with me on my phone every time we go anywhere. Why isn't there an "I know this tracker, leave me the fuck alone" option anywhere???


Yeah, seems like we should be able to acknowledge known trackers, or something. Or be able to enroll multiple phones or other devices (of multiple OSes) as owners, so if the keys move with your phone instead of hers, that counts as 'not being separated from the owner', rather than throwing another BS alert.


> I just don't understand how more people don't complain about this all the time.

I'd assume because they're using the share feature: https://support.apple.com/guide/iphone/share-an-airtag-iph41...

You can also turn them off altogether in Settings > Notifications > Tracking Notifications.


I should have mentioned - I'm on android, she's an iPhone user. Afaik there is no way to do this without also having an iPhone.


That shouldn't be happening.

The way it should work, is that if the keys are nearby the user that owns them, the tracker shouldn't be giving unknown tracker notifications to other nearby users.


> nearby the user

You mean nearby the user's phone.. nowadays the 2 are almost interchangeable..

Maybe the wife doesn't take her phone when they go somewhere (yeah, implausible, but, maybe). Or she has several phones..


Talking about core functionality, iOS `UISlider` api is broken in iOS 26. A lot of my users emailed me last few days how the font sizing menu in my hacker news app is broken because the sliders don't do anything. Turns out it's a bug many developers are facing.

This bug somehow went through the beta releases and still exists:

https://developer.apple.com/forums/thread/797468?login=true


I agree, but the German one is also pretty dodgy: Basically pay the rents of the old generation from a share of the working one. It's already coming apart due to demographic change and the next few years will be disastrous.

If there's an approach to model imho it's the Norwegian one: Actually backed by stocks, but managed and distributed by a central investment fund. It's far easier if the country is smart enough to centralize oil profits as well though...


> If there's an approach to model imho it's the Norwegian one

You just need to have a tiny population and be the 3rd biggest gas and 5th biggest oil exporter, easy peasy


Not to forget huge amount hydro power compared to population...


The pension system still works, as long as the pensioners accept that at some point there will be less money to divide between them. Demographic changes don't necessarily cause the pension system to collapse as long as the pensions are adjusted to the income generated by the working generations.

That's the part that causes friction, because (soon to be) pensioners don't want to accept lower pensions, and they still have voting rights, voting in politicians that cater to pensioners over workers.


> pay the rents of the old generation from a share of the working one.

That's what Norway does too, just less directly. The $1.75T that Norway has in their sovereign wealth fund is just a claim on future output. Germany's taxes are a claim on future output.

Or to look at it at a micro perspective. Suppose you have $10M saved for retirement, but need to hire a personal care nurse. But there aren't any and you get in a bidding war with someone who has saved $100M.


Failure in most of these systems were that original contributions were never sufficiently high to actually cover the future outgoings. Whatever those boomers that think they paid say...


More fundamentally the ratio of workers to retirees has shifted. No matter how you fund pensions this will create challenges.


When implemented those scheme goal was to make sure people who could not work anymore would have a decent end of life.

The problem is the baby boomer generation transformed it to the point retirement was "when you started living your real life". People being retired perfectly healthy for more time than what they worked is not an exception for them.

And that generation managed to pile more shit on the next generations: regulations to make housing more expensive, requiring more education for every job meaning people start working later, outsourcing has much as possible. And not having replacement level children.

So now people have to pay a lot more for those privileged generations. Meaning they have less available money to get a stable situation, meaning even less children so the problem will go worse and worse. At least one generation will have to be sacrificed to have a chance for ones after them. Will the Millenials accept to be it, or will they pull a boomer?


Yeah, like the other model where you pay people to invest money for you that makes everything more expensive (housing, healthcare) and shittier (jobs, concerts, games) or disappear for the sake of ROI that you don't even get the majority off.

PAYGO-style is definitly the more sustainable approach.


So how evolution works is that a feature needs to have an evolutionary advantage, but the specimen must also not die. So there are two adversarial pressures here, carefully balancing each other in a mammal species that already has one of the highest birth mortality rates of both mother and child. If heads were any larger, it would create a proportional amount of negative evolutionary pressure by both direct and indirect death (of the mother) at birth.

Interestingly, there seem to be some indications showing that human interventions by modern technology already show clear evolutionary trends: https://pmc.ncbi.nlm.nih.gov/articles/PMC5338417/

Humans might eventually evolve to not even being able to be born naturally anymore at some point.


That's a fascinating thought. As people with larger brains are more successful in life and more likely to have children*, mortality rates for natural births would increase, and over time we would evolve to become dependent upon modern technology.

The continued existence of our species would become dependent upon continued civilisation. A dark age could kill us, or at least cripple the population.

*how true is this? Uni-educated people tend to have lower fertility rates.


> As people with larger brains are more successful in life and more likely to have children*

> *how true is this? Uni-educated people tend to have lower fertility rates.

In the U.S. university education depends mostly on mommy and daddy's wallet size, not brain size.


I dont have data but I’d assume that wallet size is correlated to brain size


That would be quite a claim and I certainly wouldn't assume it!


"Children? With these economic conditions?"


"There's no way we could have a child now. Not with the market the way it is, no."

https://www.youtube.com/watch?v=sP2tUW0HDHA


If maternal mortality were the only issue, evolutionary pressure would also favor women with wider hips/birthing canals. After all, we see hyper intelligent individuals at the current brain size, it's clearly possible to get more processing power in there but there doesn't seem to be much reproductive benefit.


You folks might like the work being done at turso :D https://turso.tech/blog/introducing-limbo-a-complete-rewrite...


Haha, thanks for the tip! libSQL and Limbo do sound pretty interesting.


Had the same thought from the headline, but the punchline is that he's using the VPN he completely built himself and can't even trust that one.


He most likely wouldn't have this problem if he used a VPN product for clueless users. It has nothing to do with trust towards a class of technology and all to do with the fact that computers are hard.


... which is entirely a PEBCAC-type error in this case, as he never tested if his configuration worked as expected.


Which is not surprising because according to him, all it takes is running a simple bash script


This looks interesting for sure:

"ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU" https://arxiv.org/abs/2406.02075#


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: