A layer is a transformer block / layer (basically the building block of the mode...

krackers · 2025-05-20T19:40:04 1747770004

I am perfectly aware of that. I don't believe other LLMs have such embeddings per layer, only the usual weights, so these per-layer embeddings seem to be distinguished from weights in some way. Afaik trying to play the same "cache in fast storage and load on demand" wouldn't work with layer weights since you'd end up with too much back/forth (you'd touch every cached byte on each token, assuming no MoE), so I'm guessing these embeddings are structured in a way that's broken up by concept.

QuadmasterXLII · 2025-05-21T22:00:36 1747864836