Not sure here. The LLaMA models - yes, all weights fit in the small range between -2.0 .. 2.0
And some other models have more crazy numbers with even more crazier outliers within them, like you might have a weight of 12.00 between long array of typical small numbers around 0.00
I've read story about attempt to quantize RWKV model into the 4/5 bits which failed short due to the presence of outlier weights.
The author told somewhere that bigger models had worse perplexity because of this.
And some other models have more crazy numbers with even more crazier outliers within them, like you might have a weight of 12.00 between long array of typical small numbers around 0.00
I've read story about attempt to quantize RWKV model into the 4/5 bits which failed short due to the presence of outlier weights.
The author told somewhere that bigger models had worse perplexity because of this.