Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.

But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.



I think the previous code was using dot products, f32 instead of bf16.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: