Hacker Newsnew | past | comments | ask | show | jobs | submit | fotcorn's commentslogin

Also, there is zero reason to think that the big labs did not have anything similar to TurboQuant for a long time already.

The recent blog post from Google announcing TurboQuant does not change anything regarding RAM planning for the big labs.

TurboQuant itself is already a year old! So even smaller labs have probably seen and implemented it.


TurboQuant has a specific benefit by compressing the KV cache at a negligible cost to quality. That mainly means that the context lengths can go up in models for the same amount of memory, however the KV cache only accounts for something like 20% of the overall model size, and this will not dramatically decrease memory demands in the way that some of the more sensationalist reporting has stated.

In large providers KV caches are the main bottleneck, no?

The open source tooling got quantization support 3 years ago! It was a lesser type of quantization, but more than enough to prove that the savings just go to bigger models.

Cheapest i know if is around $96k

The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump.

The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.

Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.


Do any frameworks manage to use the neural engine cores for that?

Most stuff ends up running Metal -> GPU I thought


It's referring to the neural cores(for matrix mul) in the GPU itself, not the NPU.

https://creativestrategies.com/research/m5-apple-silicon-its...



I noticed that even on my M3 MLX tends to do prefill it a lot faster than llama.cpp and GGML models. Anyone knows how they do it?


Related to this, how do you get your comments that you add in the review back into your agent (Claude Code, Cursor, Codex etc.)? Everybody talks about AI doing the code review, but I want a solution for the inverse - I review AI code and it should then go away and fix all the comments, and then update the PR.


What you do is actually read the comments, think about how you can improve the code, and then improve it, whether by telling the agent to do that or doing it yourself


There’s a bunch of versions of this out there. This one’s mine, but it’s based on other ones. It works really well. It assesses the validity and importance of each comment, then handles it appropriately, creating issues, fixing the code, adding comments, updating the GH Copilot instructions file, etc.

https://github.com/cboone/cboone-cc-plugins/blob/main/plugin...


I tell claude code “review the comments on this PR” and give it the url, and that’s enough. It then uses the gh cli tool and fetches the PR and individual comments.


I suspect you don't need anything special for this. The GH API has support for reading comments from PRs. Maybe have it maintain a small local store to remember the IDs of the comments it's already read so it doesn't try to re-implement already-implemented fixes. Another similar thing you can do is a hook that reminds it to start a subagent to monitor the CI/autofix errors after it creates/updates a PR.


GitHub API is actually quite tricky here because there is a different between “comment” and “review” and “review comment” (paraphrasing, I don’t remember the details). So it’s not as simple as one API call that grabs the markdown. Of course you can write a creative one-liner to extract what you need, though.


I don't use it, but you can tag @copilot on GitHub comments and it will do so.

I don't do it because the chances of me reviewing vomited code are close to 0.


I used Claude Opus 4.5 inside Cursor to write RISC-V Vector/SIMD code. Specifically Depthwise Convolution and normal Convolution layers for a CNN.

I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.

Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.

Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.

It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.

The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.


I am really surprised by this. While I know it can generate correct SIMD code, getting a performant version is non trivial, especially for RVV, where the instruction choices and the underlying micro architecture would significantly impact the performance.

IIRC, Depthwise is memory bound so the bar might be lower. Perhaps you can try some thing with higher compute intensity like a matrix multiply. I have observed, it trips up with the columnar accesses for SIMD.


I think the ability to actually run the code on the target helped a lot with understanding and optimizing for the specific micro architecture. Quite a few of the ideas turned out to not to be optimal and were discarded.

Also important to have a few test cases the agent can quickly check against, it will often generate wrong code, but if that is easily detectable the agent can fix it and continue quickly.


can you share the code?


Seems like the hiker at the bottom of the article was introduced in 1997 and removed only in 2017: https://s.geo.admin.ch/be66brq5oby9


I have the PCIe version of NanoKVM, and I am also happy with it.

The big advantage of the PCIe version is that it does not take up space on the desk and all the cables for ATX power control an inside the PC case.

Full-sized HDMI is nice, the only limitation here is 1080p resolution. 1440p or higher would allow mirroring the output on the main monitor to the NanoKVM, but this probably a weird use-case anyway.


It says that there are multiple sizes in the second sentence of the huggingface page: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

You won't be out of work creating ggufs anytime soon :)


:)


The VPN product is very good, it's basically a thin wrapper around Mullvad, arguably the best VPN on the planet right now. At least from a privacy standpoint.


It's not anymore. They blocked port forwarding which interferes with torrents, are moving away from OpenVPN which I need.

In my opinion they are well on their way of enshittification and I moved to protonvpn.


Why do you think it's a negative result? The table on page 9 shows great results.


I think it's a pun. AlphaZero? AlphaNegative.


-273°C isn’t it?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: