> There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy there is too much loss of resolution.
I'm not sure this is accurate. From what I have seen, 8-bit quantization is usually fine, and even 4-bit is a viable tradeoff. Here are some benchmarks from TextSynth showing no significant degradation between 16 and 8 bit:
It's true if you're doing training. But for inference severe quantization is mostly okay. And there are some internal parts of a transformer running inference with a quantized model where you might want the x-bit inputs to do calculations with 16 bits like the dot product similarity between vectors.
Cool stuff. I looked at https://en.wikipedia.org/wiki/Hopper_%28microarchitecture%29 and I noticed that that the fp8 support is only for the tensor cores and not the CUDA side. Does that mean training with H100 GPU in fp8 mode would use some software ecosystem that's not the existing vast existing CUDA one? Or am I just misunderstanding CUDA cores vs tensor cores?
PS, as a joke, they should implement GPU fluint8 and get baked in non-linearity for the activation function without even using a non-linear function, https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt half decent: The hidden power of imprecise lines" by suckerpinch)
The problem with 8bit at the moment is massive performance degradation with bitsandbytes. Recent improvements in 4bit inference mean that 8bit is now a massive laggard (although there’s no reason not to expect this to resolve).
AFAIK for over-parameterized models, performing quantization or any other form of compression won't reduce accuracy by much (don't quote me on this though).
Nonetheless people do tend to use 16 bit huggingface models, and if you do go to 8 bits and it's wrong, you're never quite sure if it's the quant or the model.
The article is right, 8-bit (and especially 4-bit) is atypical for deep learning models and highly depends on the amount of parameters (larger model can handle more quantization) and can even depend on specific training hyperparameters (mainly dropout & weight decay which can induce sparsity)
Thing is, even when the impact from 4-bit is substantial, the larger parameter count it allows on the same hardware more than makes up for it. E.g. llama-30b is better at 4-bit than any derivative of llama-13b, no matter how fine-tuned or quantized.
I'm not sure this is accurate. From what I have seen, 8-bit quantization is usually fine, and even 4-bit is a viable tradeoff. Here are some benchmarks from TextSynth showing no significant degradation between 16 and 8 bit:
https://textsynth.com/technology.html
8-bit uses half as much memory and doubles the throughput for limited quality loss.