The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).
Though I couldn't get actual English output from it, so maybe something went wrong while running it.
Do I need to build their llama.cpp fork from source?
Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!
Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?
Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.
Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.
Sample output below (the model's response to "hi" in the forked llama-cli):
X ( Altern as the from (..
Each. ( the or,./, and, can the Altern for few the as ( (.
.
( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern.
. That, on, and similar, and, similar,, and, or in
Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.
Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.
Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.
Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.
I want to say "we structured the system like that, right?", i.e. maximize profit at all costs.
But it seems to be the natural outcome of the incentives, of an organization made of organisms in an entropy-based simulation.
i.e. the problem might be slightly deeper than an economic or political model. That being said, we might see something approximating post-scarcity economics in our lifetimes, which will be very interesting.
In the meantime... we might fiddle with the incentives a bit ;)
The upper arm of the K shaped economy uses their capital to invent and control the replicator and the lower arm dies off? Seems like the most realistic path to "post-scarcity" from where we're standing now.
Not sure that would help due to how tokenization works, but I remember from the early GPT-4 days that LLMs have the ability to "compress" a message into an incomprehensible string of Unicode, which the LLM itself understands perfectly, and which is 5-10x shorter than the English text.
That was a big deal when the context size was 8K; now that tokens are cheap and context is huge, nobody seems to be investigating that anymore.
I noticed that when speaking on a subject I tend to explain it in simple terms, but when writing, I tend to get bogged down in details, pedantry and technical language.
I started publishing my writing recently and I too often fall back into "debugging my mental model" mode, which while extremely valuable for me, doesn't make for very good reading.
I guess the optimal sequence would be to spend a few sessions writing privately on a subject, to build a solid mental model, then record a few talks to learn to communicate it well.
-- Similarly, journaling on paper and with voice memos seems to give me a different perspective on the same problem.
I do this too, and i find this a great use for my LLMs. I write the full detail as a part of integrating. Claude or Copilot helps me craft the communication versions.
Though I couldn't get actual English output from it, so maybe something went wrong while running it.
reply