"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.
On top of that, you will still be heavily quantized.
A nvidia spark thingie has 128GB unified RAM. They also have a dual port version of one of these things: https://www.nvidia.com/content/dam/en-zz/Solutions/networkin.... ie 2 x 100GB/s ports, they may even be 2 x 200GB/s. Once I've got my paws on one, I'll know more.
You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
I have one, and I love it. That said my buddies Mac smokes it for inference workloads in terms of tokens per second AND its more usable for other things.
If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
Get your buddy with his smoking Mac to allow multiple concurrent connections and see how it gets on compared to your Spark. Don't ever use a single "chat" test to derive performance - try running say 10 or more.
You might also notice that your Spark has a pair of QSFP28 or DD (not sure yet) type interfaces as well as the 10Gb/s ethernet - that network card is a right old beast and adds quite a lot to the cost. It is capable of either 200 or 400Mb/s and can be split into two lots of four. Your mate's Mac probably has a wifi connection and is too cool for ethernet 8)
That NIC is there for a good reason - the Spark wants some friends to cluster with and you will absolutely spank any Mac when you spaff Mac style money on say three of these beasts and some cables and cluster them up. If you want four or more, you will need a switch and Mikrotik and others have them.
Casual "tokens per second" in AI is a bit like gamers whittering on about "ping" when they are using TCP and UDP for their games. ICMP request/response is a handy way of testing network paths and can give some indications towards potential performance limitations.
can those macs boot linux? i've heard about Asahi but have no idea how far along they are. i've got my fleet configured with nix and sure, nix can target darwin, but there's a _lot_ of sharp edges there: i don't really want to pull that thread unless i have to...
I don't know. I think he just uses LMStudio most of the time on his, but that's one place I can say the spark really shines for me.
I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.
Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.
Not the new ones. Only the M1 and M2 have good support for Asahi. But you really don't need it. If you need Linux, use a VM (UTM is free and is equivalent to KVM/QEMU in speed, despite being a Type-2 Hypervisor.)
pretty much any of them, dude, as long as you have enough RAM, since it uses unified RAM and a powerful SoC CPU/GPU. Literally any M-class model, but the M5 is currently top tier.
The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.
Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.
> The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.
I see ~274 GB/sec for the DGX Spark[1], versus 307 GB/sec for M5 Pro and 460 or 614 GB/sec for M5 Max[2]. One might call 90% "basically the same", but there are nominally two tiers above "Pro".
Yes, a MacBook Pro with 128 GB and M5 Max costs $5100 (14") or $5400 (16") versus currently $4700 for the DGX Spark, but the MBP includes keyboard, mouse, battery and portability. I believe its prefill is slower and you get 2 TB vs 4 TB SSD, but overall one gives up a lot to save 10% of the cost.
I looked, but a sibling comment just provided the links. ~274 GB/sec for the DGX Spark, vs. 307 GB/sec for M5 Pro, and max 614 GB/sec (!!!) for M5 Max? Why would you completely friggin’ lie about this, or at minimum, not double-check your facts before bullshitting? Plus, you get a full-fledged computer along with it!
Apple could actually be a good deal and you folks would still make up something to not justify it. In a way, it’s amazing what Apple has accomplished- Baseless negatively-tainted perception in certain influential tech circles.
(To be fair, they’re kind of earning it. I’m glad Tim “Sweet T” Cook is departing.)
Plus, my original comment got downvoted despite being factually-correct. Thanks, Reddit. Oh, wait…
Yep. Memory bandwidth is what decides how fast LLM's generate tokens (mostly). The DGX Spark has something like 270 GB/s of memory bandwidth, and the m5 ultra is ~615 GB/s. Theoretically DOUBLE the speed. In practice he only generates like 25% more tok/s, but that's still very impressive.
The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.
In case anyone was wondering my spark is basically silent as well. It's great at being ignored, if that's really important to you. I've run mine completely headless since I bought it, including setup.
It is 2x200Gb/s physically but the PCIe bandwidth is basically only 200Gb/s so it may as well be one, and actually its a weird 2xPCIe4 not 1xPCIe8 so it appears in software as dual 100Gb/s. Its a bit odd.
DGX Spark is ~273GB/s. That’s about M5 Pro territory, and twice as fast as the M5. You’d have to go to the M5 Max, or M3 Ultra, to get higher memory bandwidth than the Spark.
>> "Unauthorized Immigration. As noted earlier, we use unauthorized immigrants to refer
to individuals who enter the U.S. without formal admission under immigration law. A large
share of these individuals are encountered by federal authorities at ports of entry, along the
border, or in the interior and are subsequently issued an NTA in immigration court, allowing
them to seek asylum or otherwise challenge removal. [...]"
Probably and it seems a good compromise to me. Under asylum law, "illegal" is technically wrong until the final judgement is rendered. And "undocumented" is IMO an obvious manipulation of language (you would not call a doctor practicing without a medical license "undocumented"). Pending a decision on legality, "unauthorized" seems both neutral and correct.
Maybe some (or many) people believe that more people will make it less "lovely". I think this is a popular stance and I think many people are more than satisfied with the current population density of their area.
I’m worried about giving a foreign hosted service access to my machine for a coding agent that can run arbitrary commands and read arbitrary files. Coding agent are much more useless if you have to sit there clicking approve on everything.
I guess I was speaking as an American, we have good domestically hosted options so although it’s probably not ideal to send this kind of data/control anywhere at all, it’s definitely a worse option for us to send it to china vs to an American company. Every user of this service has made their machines trivially exposed to become a botnet. Im wondering why I don’t see this angle more discussed in here.
Again I’m not saying you should trust an American company necessarily more than a Chinese one, but as an American, I probably can.
On the other hand, an American company can sell your chats to adtech/insurance/your government in ways that can harm you quite directly. Something worth considering.
As an American you should probably not trust them, because the giant American company has way more (legal or semi-legal) opportunities to manipulate your life than a Chinese one.
Haven't they already proven to be extremely useful? In some areas they are definitely here to stay, coding/software and search (retrieve and summarize information). There's a bunch of places where they are surely shoehorned in, overhyped, and don't belong, but there's also equally many places where they might still be transformative but aren't used yet.
But overall I think the technology is well proven.
I always leave room open for failure, and that approach has generally served me well personally and professionally. I have never been punished for having an exit strategy.
Besides, the marketplace is still in its infancy for LLMs, with a lot of unanswered questions. A lot of those questions surround the commercial viability of frontier models on bespoke hyperscaler data centers with limited usage outside of LLMs specifically should those economics be non-viable. Since that's where the memory is being tied up into, that means it's a critical question to answer in order to determine long-term investment needs into further memory fabrication.
I got an RTX 6000 pro too. I like running locally, I've learned a lot more than if I had used an API and there's less worry about overspending tokens. I accidentally spent $100 on claude api in like 2 days because I didn't know what I was doing.
The problem is that while one these gpus is a huge improvement over a laptop or a single 3090, you very quickly wish you had more. I would buy a second one, but I did the math and realized that with the current crop of models, 2 Blackwells doesn't buy me any new capability that I didn't have with one. So I would need a 3rd one. And when I buy a 3rd one I will feel like I want to running a higher quant, so then I will want a 4th.
A pair of RTX6000 cards will give you a good performance boost due to tensor parallelism, though. I haven't tried the newest predictive quants but I see about 35 tps when running the 8-bit Qwen 3.6 27B model on one board and about 50 tps on two. Probably could come close to 100 tps on an optimized setup with the latest GGUFs.
Also, the 4-bit quants of MiniMax 2.7 will run at 100 tps or so with two cards, which is pretty decent. It doesn't go any faster at all with 4 GPUs from what I've seen, so if you don't actively need 384 GB of VRAM, 2x RTX6000 is a good place to be.
Using an Epyc platform to get plenty of PCIe lanes and memory channels. I have couple of extra 3090s plugged in which get some offload and help with larger models that don't fit entirely on the blackwell.
At this point IPOs are mainly for unloading bags onto retail. Every institution who wanted a piece of these labs got in years ago and captured all the value.
Well, sad to say this is simply untrue for a few reasons.
1. "Retail" does not have enough purchasing power to have all of these "bags" unloaded on to.
2. Institutions buy shares in public firms post-IPO all the time even when they're "unloading bags onto retail". Take Uber (random example) ~83% is owned by institutions.
3. General factual history of the stock market shows that you are incorrect. Successful companies that IPO and continue to do business still have quite a lot of room left to grow. What was Google's market capitalization at IPO? What is it now? Is it possible some early investors made higher multiples than the IPO -> May 20th valuation? Yea for sure. That doesn't mean that all the value was captured. It also doesn't take into account the early stage risk for investing. Is Google an "at this point IPO"? No, but the principle is the same.
It's also worth mentioning however that the number of IPOs is going down over time. You could maybe argue that the only ones that actually IPO are all the bags, but that seems like a stretch.
These cynical comments "IPOs are mainly for unloading bags on to retail" lack explanatory power and data.
It's absolutely true. Just look at how private equity is now getting access to public markets and retirement accounts[0]. You think PE is letting the little guys in out of the goodness of their hearts? No, they've extracted as much as they can and the market is starting to question the absurd valuation of private assets.
A wise man once said: "if you're given an opportunity to cut an amazing deal and you can't tell who's getting screwed, then it's probably you"
So I take it you're going to buy shares of OpenAI on opening day then? ;)
Institutions merely owning a newly-IPO'd stock means nothing. They get access to shares at a reasonable price before opening while retail is buying at insane prices after open. See Figma as an example where institutional investors got it at $33/share and it ended the IPO day at $115/share with retail buying all the way up (including pops above that at like $127)
I thought it was common knowledge that IPOs are a way for insiders and early investors (not IPO flippers) to get a nice exit during the frenzy.
> So I take it you're going to buy shares of OpenAI on opening day then? ;)
Probably not. Do you understand however that your comment does not make sense in the context of my comment?
> Institutions merely owning a newly-IPO'd stock means nothing. They get access to shares at a reasonable price before opening while retail is buying at insane prices after open. See Figma as an example where institutional investors got it at $33/share and it ended the IPO day at $115/share with retail buying all the way up (including pops above that at like $127)
It also doesn't mean nothing - you have to go and analyze any given stock to make these kinds of claims on a per-IPO/equity basis. You also are ignoring traders and trading algorithms run by... big institutions and trading firms, and you're not accounting for volume or accounting for post-IPO purchases nor breaking those down by segment. In other words, you're just making stuff up.
I understand that, but what I'm not understanding is why this seems to be a concern. I suppose equity given to early employees is a problem too and they're just "dumping their bags on retail" after their lockup period expires?
Earlier stage investors take risk and are rewarded for that. Most companies go bankrupt and folks lose their principal. For the companies that are successful yea some go bust after IPO - so what? Are you against public markets or something? That would at least be an interesting discussion.
Google IPO'd in 2004 and returned from what I'm reading about 6,500% after IPO (and this was in 2024, so the gains have gone up much higher since then) and all of that was the bags dumped on retail. If someone wants to dump their 6,500% return on me I'll take them up on that all day every day and twice on Sunday.
I use claude/gemini as my homepage now (I have to keep switching as these companies make "updates" that periodically render their models useless). Even if I want to search for simple things, I would rather have an LLM wade through the result and extract just the information I asked for. SEO, and now mountains of slop content have made this necessary. Only a matter of time before the SEO industry in large figures out how to game LLMs too, making them equally useless.
I already saw a article recently about how to set up a business domain which can reliably show up in a search result and dump overly positive reviews into anyone's context.
On top of that, you will still be heavily quantized.
reply