> Of course there are efforts to reduce this, notably llama.cpp which runs a 13 ...

waleedk · on May 17, 2023

[Author] Completely disagree. Any analysis shows that you see perplexity reduction at 4 bits. Have a look at llama.cpp's results here:

https://github.com/ggerganov/llama.cpp#quantization

4 bit has a perplexity score 0.13 or so higher.

MacsHeadroom · on May 17, 2023

You're just wrong. You're looking at the wrong numbers. The perplexity score of a model with twice the parameters in half the bits (4bit) is FAR LOWER (ie better).

If you are limited to X RAM and have two 16bit models of size 4X and 2X then the 4X model in 4bit will always be far superior to the 2X model in 8bit, with far lower perplexity.

Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit perplexity of 5.9069. That is over 0.54 lower perplexity for the same RAM amount by using 4bit! That is MASSIVE!

jiggawatts · on May 18, 2023

Another factor is that larger models degrade less when quantized.

You have to wonder if running a huge model, say, 300B parameters at 2-bit quantization might be "optimal" in that it would fit into a single A100 or H100 GPU and likely outperform an 80B parameter 8-bit model...

Ambix · on May 18, 2023

Not sure here. The LLaMA models - yes, all weights fit in the small range between -2.0 .. 2.0

And some other models have more crazy numbers with even more crazier outliers within them, like you might have a weight of 12.00 between long array of typical small numbers around 0.00

I've read story about attempt to quantize RWKV model into the 4/5 bits which failed short due to the presence of outlier weights.

The author told somewhere that bigger models had worse perplexity because of this.

oh_sigh · on May 18, 2023

A stupid question but...what about a 16x model in 1bit?

renonce · on May 18, 2023

Has binary neural network been implemented for Transformers yet?

SethTro · on May 19, 2023

Another factor in favor of 8 bits is that inference speed will generally be 2x faster than a larger model at 4 bits.

Taek · on May 17, 2023

There's also research showing that the perplexity reduction is less at higher parameter counts. E.g. a 65b parameter model barely has any impact at all when reducing from 16bit to 4bit

Ambix · on May 18, 2023

Not actually, you might see here how bigger models have much worse perplexity with 4bit due to the weight outliers:

https://github.com/saharNooby/rwkv.cpp/issues/12

For LLaMA models - yeah, different story.

mmoskal · on May 17, 2023

Well, if you have a fixed RAM size, you're better off with the largest model you can fit at 4 bits (13B 4b is way better than 7B 16b despite being twice smaller).

Der_Einzige · on May 17, 2023

No it isn't, quantization is not free. You lose a significant amount of performance that you are not measuring properly in automated benchmarks when you quantize to that level.

You can see it in real time when you take most LLMs and compare them at different quantization levels. I can see the degradation even in the largest llama quite badly even at 8 bits.

MacsHeadroom · on May 17, 2023

Quantization is not free, but VRAM is even less free.

If you have X amount of VRAM and can fit a 16bit model of size 2X in 8bit or a model of size 4X in 4bit then the 4X model in 4bit is ALWAYS superior with lower perplexity and better performance.

You LOSE performance by using a smaller model in 8bit vs a larger model in 4bit.

astrange · on May 17, 2023

If you take a model and quantize it it's obviously going to get worse, but what if you train it again after that?

kherud · on May 17, 2023

Can somebody please explain how quantization below 8 bit works? Since a byte is the smallest addressable unit I think, is the dimensionality of the weights somehow reduced?

waleedk · on May 17, 2023

[Author] You approximate the weights using fewer bits. You also switch to ints instead of floats and then do some fancy stuff when multiplying to make it all work together.

More detail than you probably wanted: https://huggingface.co/blog/hf-bitsandbytes-integration

MacsHeadroom · on May 17, 2023

The latest release of bitsandbytes uses a new fp4 format. 4bit floating point scailing results in much lower perplexity than int4.

Also note that for a fixed memory (RAM) size, 4bit (even int4) is always superior, resulting in lower perplexity than 8bit.

E.g. LLaMA-13B int4 is far better/lower perplexity than LLaMA-7B fp8 while using the same amount of RAM.

dahart · on May 18, 2023

Software can address units of any size, by packing and unpacking bits from bytes (or more likely words) in the underlying implementation. I don’t know about any specific NN implementation here, just commenting in general that the size of the addressable unit and the size of your reads can writes can be completely independent. I routinely use bit-packing data compression techniques in CUDA, for example.

sifar · on May 18, 2023

Generally, since the memory is byte addressable, you load data which is packed into bytes. It is the compute instructions that use the specified bits needed.

So in this case one would load a byte which would have 2 4b data, and then you would have a 4b ADD or MAC which would operate on them.

If you don't have them then you need to sign/zero extend or convert the smaller bit-widths to 8/16/32b whichever is available.

Ambix · on May 18, 2023

Go see yourself :)

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

There's too many schemes right now with 4_0 and 5_1 really popular between LLM geeks.

f_devd · on May 17, 2023

I believe it's locally (inner-loop or simd op) up-cast to float8/float16/int8, but I haven't looked at the internals of llama.cpp myself

moffkalast · on May 17, 2023

> llama.cpp which runs a 13 billion parameter model on a 6GB GPU

I think that's a typo there too, the 13B model needs like 10G of memory for 4 bits, it's the 7B one that fits into 6G. Well unless you do the split thing with some layers on the CPU I guess.

DANmode · on May 18, 2023

https://news.ycombinator.com/item?id=35937505

moffkalast · on May 18, 2023

Yeah that's the split layer mode I mentioned. With 6G one can do about 18 layers, which is less than half of the 40 total for 13B.