> There’s usually no need to go beyond 16-bit accuracy, and most of the time whe...

superkuh · on May 17, 2023

It's true if you're doing training. But for inference severe quantization is mostly okay. And there are some internal parts of a transformer running inference with a quantized model where you might want the x-bit inputs to do calculations with 16 bits like the dot product similarity between vectors.

Jackson__ · on May 17, 2023

Even that is being tackled by newer GPU architectures. For example, novelai is currently training an LLM in fp8 precision, using H100 GPUs.[1]

[1] https://blog.novelai.net/anlatan-acquires-hgx-h100-cluster-4...

https://blog.novelai.net/text-model-progress-is-going-good-8...

superkuh · on May 17, 2023

Cool stuff. I looked at https://en.wikipedia.org/wiki/Hopper_%28microarchitecture%29 and I noticed that that the fp8 support is only for the tensor cores and not the CUDA side. Does that mean training with H100 GPU in fp8 mode would use some software ecosystem that's not the existing vast existing CUDA one? Or am I just misunderstanding CUDA cores vs tensor cores?

PS, as a joke, they should implement GPU fluint8 and get baked in non-linearity for the activation function without even using a non-linear function, https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt half decent: The hidden power of imprecise lines" by suckerpinch)

darknoon · on May 19, 2023

you can access the tensor cores from cuda, in practice you might generate the code with something like openai triton

qeternity · on May 17, 2023

The problem with 8bit at the moment is massive performance degradation with bitsandbytes. Recent improvements in 4bit inference mean that 8bit is now a massive laggard (although there’s no reason not to expect this to resolve).

fzliu · on May 17, 2023

AFAIK for over-parameterized models, performing quantization or any other form of compression won't reduce accuracy by much (don't quote me on this though).

waleedk · on May 17, 2023

[Author] Fair point. Adjusted the language.

Nonetheless people do tend to use 16 bit huggingface models, and if you do go to 8 bits and it's wrong, you're never quite sure if it's the quant or the model.

f_devd · on May 17, 2023

The article is right, 8-bit (and especially 4-bit) is atypical for deep learning models and highly depends on the amount of parameters (larger model can handle more quantization) and can even depend on specific training hyperparameters (mainly dropout & weight decay which can induce sparsity)

int_19h · on May 17, 2023

Thing is, even when the impact from 4-bit is substantial, the larger parameter count it allows on the same hardware more than makes up for it. E.g. llama-30b is better at 4-bit than any derivative of llama-13b, no matter how fine-tuned or quantized.