> Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.
No, 4bit quantization is the typical case.
At 4bit you can fit twice the parameters of 8bit in the same space for far better performance/perplexity/quality.
Running LLMs higher than 4bit is atypical and almost always sub-optimal (compared to running a model half the size in 8bit).
Even pretraining and finetuning in 4bit is likely to become the norm soon as fp4 becomes more well understood.
You're just wrong. You're looking at the wrong numbers. The perplexity score of a model with twice the parameters in half the bits (4bit) is FAR LOWER (ie better).
If you are limited to X RAM and have two 16bit models of size 4X and 2X then the 4X model in 4bit will always be far superior to the 2X model in 8bit, with far lower perplexity.
Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit perplexity of 5.9069. That is over 0.54 lower perplexity for the same RAM amount by using 4bit! That is MASSIVE!
Another factor is that larger models degrade less when quantized.
You have to wonder if running a huge model, say, 300B parameters at 2-bit quantization might be "optimal" in that it would fit into a single A100 or H100 GPU and likely outperform an 80B parameter 8-bit model...
Not sure here. The LLaMA models - yes, all weights fit in the small range between -2.0 .. 2.0
And some other models have more crazy numbers with even more crazier outliers within them, like you might have a weight of 12.00 between long array of typical small numbers around 0.00
I've read story about attempt to quantize RWKV model into the 4/5 bits which failed short due to the presence of outlier weights.
The author told somewhere that bigger models had worse perplexity because of this.
There's also research showing that the perplexity reduction is less at higher parameter counts. E.g. a 65b parameter model barely has any impact at all when reducing from 16bit to 4bit
Well, if you have a fixed RAM size, you're better off with the largest model you can fit at 4 bits (13B 4b is way better than 7B 16b despite being twice smaller).
No it isn't, quantization is not free. You lose a significant amount of performance that you are not measuring properly in automated benchmarks when you quantize to that level.
You can see it in real time when you take most LLMs and compare them at different quantization levels. I can see the degradation even in the largest llama quite badly even at 8 bits.
Quantization is not free, but VRAM is even less free.
If you have X amount of VRAM and can fit a 16bit model of size 2X in 8bit or a model of size 4X in 4bit then the 4X model in 4bit is ALWAYS superior with lower perplexity and better performance.
You LOSE performance by using a smaller model in 8bit vs a larger model in 4bit.
Can somebody please explain how quantization below 8 bit works? Since a byte is the smallest addressable unit I think, is the dimensionality of the weights somehow reduced?
[Author] You approximate the weights using fewer bits. You also switch to ints instead of floats and then do some fancy stuff when multiplying to make it all work together.
Software can address units of any size, by packing and unpacking bits from bytes (or more likely words) in the underlying implementation. I don’t know about any specific NN implementation here, just commenting in general that the size of the addressable unit and the size of your reads can writes can be completely independent. I routinely use bit-packing data compression techniques in CUDA, for example.
Generally, since the memory is byte addressable, you load data which is packed into bytes. It is the compute instructions that use the specified bits needed.
So in this case one would load a byte which would have 2 4b data, and then you would have a 4b ADD or MAC which would operate on them.
If you don't have them then you need to sign/zero extend or convert the smaller bit-widths to 8/16/32b whichever is available.
> llama.cpp which runs a 13 billion parameter model on a 6GB GPU
I think that's a typo there too, the 13B model needs like 10G of memory for 4 bits, it's the 7B one that fits into 6G. Well unless you do the split thing with some layers on the CPU I guess.
No, 4bit quantization is the typical case.
At 4bit you can fit twice the parameters of 8bit in the same space for far better performance/perplexity/quality.
Running LLMs higher than 4bit is atypical and almost always sub-optimal (compared to running a model half the size in 8bit).
Even pretraining and finetuning in 4bit is likely to become the norm soon as fp4 becomes more well understood.