Hacker Newsnew | past | comments | ask | show | jobs | submit | genpfault's commentslogin

> "A system's purpose is what it does"

POSIWID: https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...



Yes, I don't even know how I didn't know about this at the time of wiring the article. But a must read for sure!


Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768


Super cool! Also with `--fit on` you don't need `--ctx-size 32768` technically anymore - llama-server will auto determine the max context size!


Nifty, thanks for the heads-up!


What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!


MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.


Thanks to you I decided to give it a go as well (didn't think I'd be able to run it on 7900xtx) and I must say it's awesome for a local model. More than capable for more straightforward stuff. It uses full VRAM and about 60GBs of RAM, but runs at about 10tok/s and is *very* usable.



Because TFA never bothered to define it:

Broadband Network Gateway (BNG)[1]

[1]: https://github.com/codelaboratoryltd/bng#bng-broadband-netwo...


Thanks! "OLT" was also new to me. In case others find it helpful:

> OLT = Optical Line Terminal.

> In ISP fiber (typically GPON/EPON) infrastructure, it’s the provider-side device at the central office/headend that terminates and controls the passive optical network: it connects upstream into the ISP’s aggregation/core network and downstream via fiber (through splitters) to many customers’ ONTs/ONUs, handling PON line control, provisioning, QoS, and traffic aggregation.


Thanks.. was reading the article like WTF is "BNG"


Is it the FTTX equivalent of a BRAS?


Yes, exactly. BRAS is functionally the same as BNG.


So what is BRAS?



tap tap tap tap tap


tap tap tap tap tap tap tap


tap tap tap tap tap tap tap tap tap tap tap


Getting ~150 tok/s on an empty context with a 24 GB 7900XTX via llama.cpp's Vukan backend.


Again, you're using some 3rd party quantisations, not the weights supplied by Nvidia (which don't fit in 24GB).


It was right there[1] in the assembly video.

[1]: https://youtu.be/pcAEqbYwixU?t=1038


> Ctrl+S to save

XOFF ignored, mumble mumble


> triple-backtick code blocks

If only :(


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: