What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!
MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.
Not as good as running the entire thing on the GPU, of course.
Thanks to you I decided to give it a go as well (didn't think I'd be able to run it on 7900xtx) and I must say it's awesome for a local model. More than capable for more straightforward stuff. It uses full VRAM and about 60GBs of RAM, but runs at about 10tok/s and is *very* usable.
Thanks! "OLT" was also new to me. In case others find it helpful:
> OLT = Optical Line Terminal.
> In ISP fiber (typically GPON/EPON) infrastructure, it’s the provider-side device at the central office/headend that terminates and controls the passive optical network: it connects upstream into the ISP’s aggregation/core network and downstream via fiber (through splitters) to many customers’ ONTs/ONUs, handling PON line control, provisioning, QoS, and traffic aggregation.
POSIWID: https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...
reply