> Cache effects are going to dominate. I see this stated frequently, but I rarel...

> Cache effects are going to dominate.

I see this stated frequently, but I rarely see evidence to back it up.

I grabbed his code, removed the drawing function and tested it with perf [1]. (Using glibc's qsort).

First test was with 1 million elements in the array. That's 4 MB of data, so it fits in L3 cache with room to spare.

Perf reports an average of 1.02 instructions/cycle with a cache miss rate of 2.829%.

Second test: 10 million elements (40 MB). Doesn't fit in cache.

Perf says: 0.95 instructions/cycle, cache miss rate of 37.352%.

Third test has 500 million elements (2 GB). Fits in RAM but not cache.

1.01 instructions/cycle and a cache miss rate of 66.555%.

If the CPU was stalling frequently, waiting for data, I would expect the number of instructions/cycle to drop dramatically as the cache miss rate increases. It drops slightly from 4MB to 40MB, but inexplicable increases from 40MB to 2GB.

So I see two possibilities:

1. I'm dumb and misinterpreting the perf data.

2. The out-of-order core can work around data stalls so effectively for this algorithm, the cache miss rate doesn't really matter.

[1] https://perf.wiki.kernel.org/index.php/Main_Page