> what did i get wrong here? You don't know how an LLM works and you are operati...

Palmik · 2025-11-19T19:11:49 1763579509

It's a fair question, even if it might be coming from a place of misunderstanding.

For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).

[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929

ed · 2025-11-19T20:03:11 1763582591

Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory

qsort · 2025-11-19T20:26:03 1763583963

My comment was harsher than it needed to be and I'm sorry, I think I should have gotten my point across in a better way.

With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".