> We used Trillium TPUs to train the new Gemini 2.0, Wow. I knew custom Google s...

dekhn · on Dec 11, 2024

Google silicon TPUs have been used for training for at least 5 years, probably more (I think it's 10 years). They do not depend on nvidia gpus for the majority of their projects. Took TPUs a while to catch up on some details, like sparsity.

summerlight · on Dec 11, 2024

This aligns with my knowledge. I don't know much about LLM, but TPU has been used for training deep prediction models in ads at least from 2018, though there were some gap filled by CPU/GPU for a while. Nowaday, TPU capacity is probably more than the combination of CPU/GPU.

felarof · on Dec 11, 2024

+1, almost all (if not all) Google training runs on TPU. They don't use NVIDIA GPUs at all.

dekhn · on Dec 11, 2024

at some point some researchers were begging for GPUs... mainly for sparse work. I think that's why sparsecore was added to TPU (https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...) in v4. I think at this point with their tech turnaround time they can catch up as competitors add new features and researchers want to use them.

felarof · on Dec 11, 2024

dumb question: wdym by sparse work? Is it embedding lookups?

(TPUs have had BarnaCore for efficient embedding lookups since TPU v3)

dekhn · on Dec 11, 2024

Mostly embedding, but IIRC DeepMind RL made use of sparsity- basically, huge matrices with only a few non-zero elements.

BarnaCore existed and was used, but was tailored mostly for embeddings. BTW, IIRC they were called that because they were added "like a barnacle hanging off the side".

The evolution of TPU has been interesting to watch; I came from the HPC and supercomputing space, and seeing Google as mostly-CPU for the longest time, and then finally learning how to build "supercomputers" over a decade+ (gradually adding many features that classical supercomputers have long had), was a very interesting process. Some very expensive mistakes along the way. But now they've paid down almost all the expensive up-front costs and can now ride on the margins, adding new bits and pieces while increasing the clocks and capacities on a cadence.

amelius · on Dec 11, 2024

Do they have the equivalent of CUDA, and what is it called?

dekhn · on Dec 11, 2024

Not exactly, although CUDA is a huge topic. See https://gist.github.com/shawwn/0e524d4a7a5d8fb152a86616559cc... for some description of the process. Basically, jax or other program is converted to XLA, https://opensource.googleblog.com/2023/05/pjrt-simplifying-m... then lowered to the specific architecture (which coudl be CPU, GPU, or TPU). Last time I looked, it was a horribly complicated stack with many parts changing rapidly, although with the switch to jax, things got cleaned up a bit. My personal favorite bits are the lower levels of jax, xla, and pjrt.

rajnathani · on Dec 12, 2024

Is sparsity used for training though? I thought it was an inference-only thing for both GPUs and TPUs. Furthermore, I think TPUs have not been able to support training models like GANs, etc. I'm not sure if that has changed now.

Permit · on Dec 11, 2024

My understanding is that the Trillium TPU was primarily targeted at inference (so it’s surprising to see it was used to train Gemini 2.0) but other generations of TPUs have targeted training. For example the chip prior to this one is called TPU v5p and was targeted toward training.

ec109685 · on Dec 12, 2024

Apple trained their LLM on Google TPU’s: https://www.theregister.com/2024/07/30/apple_google_tpu_ai/

lern_too_spel · on Dec 11, 2024

Since TPUv2, announced in 2017: https://arstechnica.com/information-technology/2017/05/googl...

The superscalers are all working on this. https://aws.amazon.com/ai/machine-learning/trainium/

felarof · on Dec 11, 2024

TPUs have been used for training since a long time.

(PS: we are startup trying to make TPUs more accessible, if you wanna fine-tune Llama3 on TPU check out https://github.com/felafax/felafax)

drusepth · on Dec 11, 2024

Why is that a huge advantage over AI competitors? Just not having to fight for limited Nvidia supply?

aseipp · on Dec 11, 2024

That is one factor, but another is total cost of ownership. At large scales something that's 1/2 the overall speed but 1/3rd the total cost is still a net win by a large margin. This is one of the reasons why every major hyperscaler is, to some extent, developing their own hardware e.g. Meta, who famously have an insane amount of Nvidia GPUs.

Of course this does not mean their models will necessarily be proportionally better, nor does it mean Google won't buy GPUs for other reasons (like providing them to customers on Google Cloud.)

badlucklottery · on Dec 11, 2024

Vertical integration.

Nvidia is making big bucks "selling shovels in a gold rush". Google has made their own shovel factory and they can avoid paying Nvidia's margins.

bufferoverflow · on Dec 11, 2024

TPUs are cheaper and faster than GPUs. But it's custom silicon. Which means barrier to entry is very very high.

felarof · on Dec 11, 2024

> Which means barrier to entry is very very high.

+1 on this. The tooling to use TPUs still needs more work. But we are betting on building this tooling and unlocking these ASIC chips (https://github.com/felafax/felafax).

imtringued · on Dec 12, 2024

Nah, just take a look at their own comparison page:

https://cloud.google.com/tpu/docs/v6e

The A100 (from 2020) had 300 TFLOPs (1.5x) and 80GB HBM. It's only now with Trillium that they are starting to actually beat Nvidia in a chip vs chip battle, but being faster per chip was never the point. TPUs were supposed to be mass produced and connected into pods so that they become cheaper and they indeed are.

hustwindmaple1 · on Dec 12, 2024

Exactly. Kind of like the old days when they put together a massive amount of commodity CPUs to build search.

ein0p · on Dec 12, 2024

They are cheaper, yes. But GPUs are faster and easier to program.

xnx · on Dec 11, 2024

Yes. And cheaper operating cost per TFLOP.

m3kw9 · on Dec 11, 2024

Maybe only for their own models

walterbell · on Dec 11, 2024

Now any Google customer can use Trillium for training any model?

richards · on Dec 11, 2024

[Google employee] Yes, you can use TPUs in Compute Engine and GKE, among other places, for whatever you'd like. I just checked and the v6 are available.

KaoruAoiShiho · on Dec 11, 2024

Is there not goin to be a v6p?

richards · on Dec 11, 2024

Can't speculate on futures, but here's the current version log ... https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...

xnx · on Dec 11, 2024

Google trained Llama-2-70B on Trillium chips

monocasa · on Dec 11, 2024

I thought llama was trained by meta.

DrBenCarson · on Dec 11, 2024

> Google trained Llama

Source? This would make quite the splash in the market

xnx · on Dec 11, 2024

It's in the article: "When training the Llama-2-70B model, our tests demonstrate that Trillium achieves near-linear scaling from a 4-slice Trillium-256 chip pod to a 36-slice Trillium-256 chip pod at a 99% scaling efficiency."

llm_nerd · on Dec 11, 2024

I'm pretty sure they're doing fine-tune training, using Llama because it is a widely known and available sample. They used SDXL elsewhere for the same reason.

Llama 2 was released well over a year ago and was training between Meta and Microsoft.

hhh · on Dec 11, 2024

They can just train another one.

llm_nerd · on Dec 11, 2024

Llama 2 end weights are public. The data used to train it, or even the process used to train it, are not. Google can't just train another Llama 2 from scratch.

They could train something similar, but it'd be super weird if they called it Llama 2. They could call it something like "Gemini", or if it's open weights, "Gemma".

lern_too_spel · on Dec 12, 2024

The article says they used maxtext to load the weights and pretrain on additional data. It looks like the instructions for doing that are here: https://github.com/AI-Hypercomputer/maxtext/blob/main/gettin...

ein0p · on Dec 12, 2024

They don't mean literally LLaMA. They mean a model with the same architecture.