> We used Trillium TPUs to train the new Gemini 2.0,
Wow. I knew custom Google silicon was used for inference, but I didn't realize it was used for training too. Does this mean Google is free of dependence on Nvidia GPUs? That would be a huge advantage over AI competitors.
Google silicon TPUs have been used for training for at least 5 years, probably more (I think it's 10 years). They do not depend on nvidia gpus for the majority of their projects. Took TPUs a while to catch up on some details, like sparsity.
This aligns with my knowledge. I don't know much about LLM, but TPU has been used for training deep prediction models in ads at least from 2018, though there were some gap filled by CPU/GPU for a while. Nowaday, TPU capacity is probably more than the combination of CPU/GPU.
at some point some researchers were begging for GPUs... mainly for sparse work. I think that's why sparsecore was added to TPU (https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...) in v4. I think at this point with their tech turnaround time they can catch up as competitors add new features and researchers want to use them.
Mostly embedding, but IIRC DeepMind RL made use of sparsity- basically, huge matrices with only a few non-zero elements.
BarnaCore existed and was used, but was tailored mostly for embeddings. BTW, IIRC they were called that because they were added "like a barnacle hanging off the side".
The evolution of TPU has been interesting to watch; I came from the HPC and supercomputing space, and seeing Google as mostly-CPU for the longest time, and then finally learning how to build "supercomputers" over a decade+ (gradually adding many features that classical supercomputers have long had), was a very interesting process. Some very expensive mistakes along the way. But now they've paid down almost all the expensive up-front costs and can now ride on the margins, adding new bits and pieces while increasing the clocks and capacities on a cadence.
Not exactly, although CUDA is a huge topic. See https://gist.github.com/shawwn/0e524d4a7a5d8fb152a86616559cc... for some description of the process. Basically, jax or other program is converted to XLA, https://opensource.googleblog.com/2023/05/pjrt-simplifying-m... then lowered to the specific architecture (which coudl be CPU, GPU, or TPU). Last time I looked, it was a horribly complicated stack with many parts changing rapidly, although with the switch to jax, things got cleaned up a bit. My personal favorite bits are the lower levels of jax, xla, and pjrt.
Is sparsity used for training though? I thought it was an inference-only thing for both GPUs and TPUs. Furthermore, I think TPUs have not been able to support training models like GANs, etc. I'm not sure if that has changed now.
My understanding is that the Trillium TPU was primarily targeted at inference (so it’s surprising to see it was used to train Gemini 2.0) but other generations of TPUs have targeted training. For example the chip prior to this one is called TPU v5p and was targeted toward training.
That is one factor, but another is total cost of ownership. At large scales something that's 1/2 the overall speed but 1/3rd the total cost is still a net win by a large margin. This is one of the reasons why every major hyperscaler is, to some extent, developing their own hardware e.g. Meta, who famously have an insane amount of Nvidia GPUs.
Of course this does not mean their models will necessarily be proportionally better, nor does it mean Google won't buy GPUs for other reasons (like providing them to customers on Google Cloud.)
+1 on this. The tooling to use TPUs still needs more work. But we are betting on building this tooling and unlocking these ASIC chips (https://github.com/felafax/felafax).
The A100 (from 2020) had 300 TFLOPs (1.5x) and 80GB HBM. It's only now with Trillium that they are starting to actually beat Nvidia in a chip vs chip battle, but being faster per chip was never the point. TPUs were supposed to be mass produced and connected into pods so that they become cheaper and they indeed are.
[Google employee] Yes, you can use TPUs in Compute Engine and GKE, among other places, for whatever you'd like. I just checked and the v6 are available.
It's in the article: "When training the Llama-2-70B model, our tests demonstrate that Trillium achieves near-linear scaling from a 4-slice Trillium-256 chip pod to a 36-slice Trillium-256 chip pod at a 99% scaling efficiency."
I'm pretty sure they're doing fine-tune training, using Llama because it is a widely known and available sample. They used SDXL elsewhere for the same reason.
Llama 2 was released well over a year ago and was training between Meta and Microsoft.
Llama 2 end weights are public. The data used to train it, or even the process used to train it, are not. Google can't just train another Llama 2 from scratch.
They could train something similar, but it'd be super weird if they called it Llama 2. They could call it something like "Gemini", or if it's open weights, "Gemma".
Wow. I knew custom Google silicon was used for inference, but I didn't realize it was used for training too. Does this mean Google is free of dependence on Nvidia GPUs? That would be a huge advantage over AI competitors.