As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even ...

		ein0p on April 1, 2024 \| parent \| context \| favorite \| on: LLaMA now goes faster on CPUs As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement. But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.

I think the previous code was using dot products, f32 instead of bf16.