I see this stated frequently, but I rarely see evidence to back it up.
I grabbed his code, removed the drawing function and tested it with perf [1]. (Using glibc's qsort).
First test was with 1 million elements in the array. That's 4 MB of data, so it fits in L3 cache with room to spare.
Perf reports an average of 1.02 instructions/cycle with a cache miss rate of 2.829%.
Second test: 10 million elements (40 MB). Doesn't fit in cache.
Perf says: 0.95 instructions/cycle, cache miss rate of 37.352%.
Third test has 500 million elements (2 GB). Fits in RAM but not cache.
1.01 instructions/cycle and a cache miss rate of 66.555%.
If the CPU was stalling frequently, waiting for data, I would expect the number of instructions/cycle to drop dramatically as the cache miss rate increases. It drops slightly from 4MB to 40MB, but inexplicable increases from 40MB to 2GB.
So I see two possibilities:
1. I'm dumb and misinterpreting the perf data.
2. The out-of-order core can work around data stalls so effectively for this algorithm, the cache miss rate doesn't really matter.
I wonder if the speculatively executed instructions that are not used are still counted as a part of the instructions/cycle number? Definitely interesting results though.
I see this stated frequently, but I rarely see evidence to back it up.
I grabbed his code, removed the drawing function and tested it with perf [1]. (Using glibc's qsort).
First test was with 1 million elements in the array. That's 4 MB of data, so it fits in L3 cache with room to spare.
Perf reports an average of 1.02 instructions/cycle with a cache miss rate of 2.829%.
Second test: 10 million elements (40 MB). Doesn't fit in cache.
Perf says: 0.95 instructions/cycle, cache miss rate of 37.352%.
Third test has 500 million elements (2 GB). Fits in RAM but not cache.
1.01 instructions/cycle and a cache miss rate of 66.555%.
If the CPU was stalling frequently, waiting for data, I would expect the number of instructions/cycle to drop dramatically as the cache miss rate increases. It drops slightly from 4MB to 40MB, but inexplicable increases from 40MB to 2GB.
So I see two possibilities:
1. I'm dumb and misinterpreting the perf data.
2. The out-of-order core can work around data stalls so effectively for this algorithm, the cache miss rate doesn't really matter.
[1] https://perf.wiki.kernel.org/index.php/Main_Page