Beautiful idea: a number representation where the distribution of accuracy aligns with the distribution of typical use.
>Real numbers can’t be perfectly represented in hardware simply because there are infinitely many of them. To fit into a designated number of bits, many real numbers have to be rounded. The advantage of posits comes from the way the numbers they represent exactly are distributed along the number line. In the middle of the number line, around 1 and -1, there are more posit representations than floating point. And at the wings, going out to large negative and positive numbers, posit accuracy falls off more gracefully than floating point.
>“It’s a better match for the natural distribution of numbers in a calculation,” says Gustafson. “It’s the right dynamic range, and it’s the right accuracy where you need more accuracy. There’s an awful lot of bit patterns in floating-point arithmetic no one ever uses. And that’s waste.”
Ah. So it's because the machine learning people mostly work in the -1 .. 1 range, but it's possible for their numbers to go outside that range. So they need a compact representation which uses most of the possible values for the range of interest. If you're down to 8 bit numbers, something like this makes sense for custom machine learning hardware.
I wonder if it has graphics applications for high dynamic range images. Probably not.
The machine learning space probably has the easiest time to adopt a new number format. The players are big enough to define their own silicon and training/inference don't even need the same number format. As long as it works on their new specialized training hardware it's good to go.
The real challenge is probably in the power efficiency. As far as I know right now power consumption is the biggest cost factor when training new models. Everything else seems like a minor trade-off but when accounting for the power consumption there is real money involved.
Hardware is not cheap either at this scale. Reallly rough math follows, please anyone correct me if I err. A100 cards are >$10,000 each. A node is typically 8 and a cluster might be 64 A100s. So about a million in hardware for the medium cluster of 64 A100s. (Cross checking, a machine from lambda labs is apparently ~175k for 8 gpus, so 1.4m for the cluster above)
Those A100 use 300watts each, plus let’s say 500watts for the rest. (8gpus300+500)8servers ≈ 25Kw * 8760 hours = 219k KwH / year. So if your costs are $0.15/KwH that’s only on the order of $32k/year.
I see your numbers with the A100. However, consider the 3090 which is down to $1000 and uses 350W. Double that if you consider cooling and overhead in a data center. In Europe the electricity prices are through the roof and easily cost >0.5$/kWh. If you use those numbers a single card would cost you 0.35kW * 2 * 0.5$/kWh * 24h * 365 = 3066$ per year. So I guess it highly depends on your specific location and circumstances.
It’s still more complicated because you have to amortize costs over several years and then also keep in mind that big players will design their own hardware to avoid the A100 cost and energy. Disks and other devices will also consume power. And you datacenter cooling is now more expensive, etc.
Yes, it’s complicated, for sure. But if we amortize over 3 years, and triple costs for the power: it’s ~500k/yr for hardware and ~100k/yr for power.
In terms of TPUs or other custom accelerators, sure, they exist. However most definitely aren’t building their own hardware.
ETA: I’m not saying power is irrelevant, it clearly matters. But saying it’s the dominant financial constraint is clearly wrong, at least below Google/Amazon/Apple scale. Never mind the cost of the people running these trainings!
It might. The whole reason behind gamma correction is human brightness perception is logarithmic. By having more resolution at lower values, posits could correct for the reason we need in gamma in the first place, though maybe, not as well as gamma itself.
But if it mitigates error accumulation in perceptually-friendly ways, it might still have value in calculations (once there’s native support).
I've been studying color theory (its harder than you would think) on my spare time; and I did my undergraduate thesis was on posit few years ago. I think using posits may be a good idea to compress the luminosity, but I'm not sure how good it would be for the ab on a Lab-like format.
A format close to the cone fundamentals, like XYZ, could have benefits being encoded in some kind of non-standart posit.
I'll add this into the stack of things I eventually I'll look into.
1. IEE754 floats are already nonlinear. The precision is the highest around zero.
2. Bitmap images are not using floating-point values, except in some super niche use cases like GIS data, so “posits” are irrelevant for the use case you’ve posited. (ba-dum-ts)
3. Non-linear gamma is completely unnecessary for bit depths >= 16.
Floating point bitmaps are a lot more common than you think. A lot of consumers don't see them, but their software will internally use floats. Floating point bitmaps are standard practice in visual effects and animation. The OpenEXR format exists for the exchange of images with floating point bit depths up to 32.
That result happens for the same reason that 1 / 3 * 3 != 1, if you use decimal: 1 / 3 = .333, .333 * 3 = .999, which is different from 1.00.
0.1 is the same as 1 / 10, which does not have a finite representation in binary notation, just as 1 / 3 does not have a finite representation in binary or decimal notation.
This is a problem for all number systems. The true issue here is not a precision in the underlying bit representation of IEEE-754, it's that 0.1 and 0.2 aren't actually a valid IEEE-754 numbers, and so they get casted to their best approximations.
>Real numbers can’t be perfectly represented in hardware simply because there are infinitely many of them.
There are also infinitely many integers, but we can represent them just fine inside a finite bound. The problem with reals (and rationals to a lesser extent) is that, within any range we are interested in, a single real number can be infinitely 'long'.
I like this idea of choosing a representation for numbers based on the use case. Like, "ints" and "floats" but even more specialized.
For example, what would be the most efficient binary representation of probabilities values between 0 and 1? In the future, I can imagine hardware + software that is specialized for examples like this.
I feel like probabilities are best stored as inverse logistic function, because the closer the probability is to 0 or 1, the less you care about significant digits, because the events are going (not) to happen anyway. (That's why I am proponent of LNS.)
I feel that it's actually the opposite. There is a world of difference between 0.0...01 and 0.0...001 , where ... represents the same number of 0s in the two cases. Same applies for the other end.
edit:
Do I really want more resolution between 0.41 and 0.42 than between 0.01 and 0.02?
You're dead on. Metals are sold by how fine they are, so .999%, .9995% (should end in a 7 numeral not a 5, whole industry is fucking up right there, but at any rate) .9999%, more nines more purity better experimental results. When they talk about 4 nines or 5 nines the actual measurement is of the
f(x)= - log(1 - x). That is the purity. Negative of the log of the complement. And that's why 4.5 nines should be .99997% not fucking .99995%, so stupid (it's almost exactly .99997%). Mining engineers, come on! Such impressive degrees! PhD's all over the place! Professors even! I'm sorry, don't mind me with my math.
Some experiments like with liquid crystals early on required purities that what chemical company was it, Merck I think, around the year 1900, complained like they felt insulted. Told the inventor of liquid crystals because it was an absurd amount of purity required to get them actually working. But look at them go! Right in front of your very eyes!
The logistic curve becomes denser close to 0 and 1. Which makes sense: you will want to tell apart 1 defect per million from 0.01 dpm, and 5-sigma process (99.977%) from 6-sigma (99.99966%), much more than tell apart 30% from 30.001%.
real numbers can't be represented in hardware because most of them are incomputable. It's not that there are too many of them, it's that any one of them is unlikly to be representable as the output of any program.
>Real numbers can’t be perfectly represented in hardware simply because there are infinitely many of them. To fit into a designated number of bits, many real numbers have to be rounded. The advantage of posits comes from the way the numbers they represent exactly are distributed along the number line. In the middle of the number line, around 1 and -1, there are more posit representations than floating point. And at the wings, going out to large negative and positive numbers, posit accuracy falls off more gracefully than floating point.
>“It’s a better match for the natural distribution of numbers in a calculation,” says Gustafson. “It’s the right dynamic range, and it’s the right accuracy where you need more accuracy. There’s an awful lot of bit patterns in floating-point arithmetic no one ever uses. And that’s waste.”