The basic MLP block in this model uses a ReLU^2 activation function (x <- ReLU(x...

throwaway2027 · 2025-11-15T02:06:14 1763172374

Isn't it because ReLU is cheap and ^2 is squared loss?

kouteiheika · 2025-11-15T09:35:17 1763199317

When it comes to compute cost the choice of activation function makes little difference nowadays (and it can often be fused with whatever operation comes before it, which makes it effectively free).

The real reason is simple: it was inherited.

The relu^2 was used in the nanogpt speedrun[1] because it produced the best empirical results, then Andrej based his nanochat on the nanogpt speedrun without changing the activation function, and then this project was based on nanochat.

[1] -- https://github.com/KellerJordan/modded-nanogpt

macleginn · 2025-11-15T11:01:35 1763204495

There has been some experimentation with the use of ReLU^2 in language models in recent years, e.g., here: https://proceedings.neurips.cc/paper_files/paper/2021/file/2...