Hacker Newsnew | past | comments | ask | show | jobs | submit | andai's commentslogin

The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).

Though I couldn't get actual English output from it, so maybe something went wrong while running it.


But brother, he is shipping at inference speed!

Does anyone know how to run this on CPU?

Do I need to build their llama.cpp fork from source?

Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!

Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?

Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.


Thanks. Did you need to use Prism's llama.cpp fork to run this?

Yep.

Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.

Sample output below (the model's response to "hi" in the forked llama-cli):

X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in


I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size.

1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`

2. Then (assuming you already have xcode build tools installed):

  cd llama.cpp
  cmake -B build -DGGML_METAL=ON
  cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
3. Finally, run it with (you can adjust arguments):

  ./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string
Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

To the author: why is this taking 4.56GB ? I was expecting this to be under 1GB for 4B model. https://ibb.co/CprTGZ1c

And this is when Im serving zero prompts.. just loaded the model (using llama-server).


I did this: https://image.non.io/2093de83-97f6-43e1-a95e-3667b6d89b3f.we...

Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.

Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.


Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?

My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.

https://www.anthropic.com/engineering/contextual-retrieval

Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.


Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.

I want to say "we structured the system like that, right?", i.e. maximize profit at all costs.

But it seems to be the natural outcome of the incentives, of an organization made of organisms in an entropy-based simulation.

i.e. the problem might be slightly deeper than an economic or political model. That being said, we might see something approximating post-scarcity economics in our lifetimes, which will be very interesting.

In the meantime... we might fiddle with the incentives a bit ;)


> we might see something approximating post-scarcity economics in our lifetimes

Can you elaborate more on this? All I see is growing inequality.


The upper arm of the K shaped economy uses their capital to invent and control the replicator and the lower arm dies off? Seems like the most realistic path to "post-scarcity" from where we're standing now.

How can I learn more about this? I looked into it recently but didn't get very far.

This seems like the kind of thing that should be more widely known, and have some good tutorials written for it :)


The Wikidata documentation is good:

https://www.wikidata.org/wiki/Wikidata:Introduction

And you can find lots of SPARQL examples here:

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...


Very interesting. What are goals, fitness and mutation in this context?

I told mine to remove all unnecessary words from a sentence and talk like caveman, which should result in another 50% savings ;)


Have you tried asking it to remove vowels?

Not sure that would help due to how tokenization works, but I remember from the early GPT-4 days that LLMs have the ability to "compress" a message into an incomprehensible string of Unicode, which the LLM itself understands perfectly, and which is 5-10x shorter than the English text.

That was a big deal when the context size was 8K; now that tokens are cheap and context is huge, nobody seems to be investigating that anymore.


I'm a fan of Dr Seuss mimicry, the extra tokens are worth the entertainment.

"I told it don't make mistakes, and don't use a lot of tokens! I'm a 10x Engineer now!" (=

I noticed that when speaking on a subject I tend to explain it in simple terms, but when writing, I tend to get bogged down in details, pedantry and technical language.

I started publishing my writing recently and I too often fall back into "debugging my mental model" mode, which while extremely valuable for me, doesn't make for very good reading.

I guess the optimal sequence would be to spend a few sessions writing privately on a subject, to build a solid mental model, then record a few talks to learn to communicate it well.

-- Similarly, journaling on paper and with voice memos seems to give me a different perspective on the same problem.


I do this too, and i find this a great use for my LLMs. I write the full detail as a part of integrating. Claude or Copilot helps me craft the communication versions.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: