As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.
But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.
But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.