There are instructions that would almost never be useful. See Linus's rant on cm...

gpderetta · on Aug 24, 2016

Linus complained because CMOV converts a dependency breaking conditional jump with an high latency instruction which preserves dependencies. With the Pentium 4s of that era it was often a net loss as CMOV was particularly high latency (~6 clocks, low throughput), but today it is actually quite good (1 clock, high throughput on Skylake), and Linus is ok with it.

dbcurtis · on Aug 24, 2016

I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it is because the hardware has very good branch prediction, and the compiler has intimate knowledge of how the branch prediction hardware works.

Then the compiler works hard on determining if branches are highly predictable. Is the branch part of closing a loop? Predict that you will stay in the loop. Is the branch checking for an exception condition? Predict that the exception is rare. OTOH, there are some branches that are, in fact, "flakey" on a dynamic execution basis. Deciding where to push a particular particle of data based on it's value inside an innermost processing loop, for instance.

So... the compiler identifies "flakey" branches, it emits code to compute both branches of the if, and CMOVs the desired result at the end. That allows the instruction issue pipeline to avoid seeing a branch at all, thus avoiding polluting the branch cache with a flakey branch, and avoiding a whole bunch of pipeline flushes in the back end. At the cost of using extra back-end resources on throw-away work.

CMOV is in X86 for a reason. On Pentium Pro and later, it is a win if your compiler has good branch analysis.

CountSessine · on Aug 24, 2016

That's an interesting perspective. I always thought that cmov was more important if the CPU had really bad branch prediction - the MIPS CPUs that were in the PS2 and PSP had famously bad branch prediction; there were all sorts of interesting cases where we could manually recode using cmov to avoid mispredictions in hot-path code.

But I guess if the compiler knew enough about the branch prediction strategy employed by the CPU, it could do this optimization itself when prediction wouldn't work right, even on a CPU with good prediction.

astrange · on Aug 25, 2016

> I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it is because the hardware has very good branch prediction, and the compiler has intimate knowledge of how the branch prediction hardware works.

gcc and clang really have no idea how x86 branch predictors work, and they're secret so nobody is going to contribute a model for them. I haven't read the if-conversion pass but it's just some general heuristics.

There also isn't anything guessing if a specific branch is mispredictable or not, it's more like it converts anything that looks "math-like" instead of "control-flow-like" to cmov.

caf · on Aug 25, 2016

You don't even need perfect, "insider" branch analysis if you can do profile-guided optimisation using actual branch performance counters in the profile.

dbcurtis · on Aug 25, 2016

Profile guided optimization opt will certainly be better if you profile with a decent data set. And take the time. From what I've experienced, the main users of profile guided optimization are compiler validation engineers.

chkras · on Aug 24, 2016

Linus rants about most everything, CMOV helped avoid branch prediction issues etc.

majewsky · on Aug 25, 2016

At least he has arguments and data to back it up.

Nelson69 · on Aug 24, 2016

There are also very useful instructions that don't have semantics for most programming languages and have little or no use outside of booting a system. Like LGDT and some of the tricks to switch from real mode to protected mode.

What's maybe more curious, some instructions in certain modes aren't even implemented by some assemblers. I don't think it's entirely unusual to see hand crafted opcodes in an OS boot sequence. When the assembler doesn't do it, the compiler certainly isn't going to.

wscott · on Aug 24, 2016

It is _very_ useful when hand optimizing loops in assembly. But for the compiler, the newer processors have such incredible branch predictors it is usually a bad idea for the compiler assume a branch is poorly predicted.

Now if you are using profile guided optimizations (--prof-gen/prof-use) AND you use the processor's performance counters as part of the feedback to the compiler _then_ I could see the compiler correctly using this instruction.

snaky · on Aug 25, 2016

I have a mixed feeling about that incredible progress in modern CPUs branch prediction. Don't they just use idle execution units and run both branches in parallel behind the scenes? It looks great on microbenchmarks when there's a bunch of idle execution units to use, a memory bus and caches are underused. It may not perform so well on real load when there's no idle execution units available to use for free, and memory bus and cache lines are choking on load.

Is it somewhat similar to these two examples?

> If it owns that line then CAS is extremely fast - core doesn't need to notify other cores to do that operation. If core doesn't own it, the situation is very different - core has to send request to fetch cache line in exclusive mode and such request requires communication with all other cores. Such negotiation is not fast, but on Ivy Bridge it is much faster than on Nehalem. And because it is faster on Ivy Bride, core has less time to perform a set of fast local CAS operations while it owns cacheline, therefore total throughput is less. I suppose, a very good lesson learned here - microbenchmarking can be very tricky and not easy to do properly. Also results can be easily interpreted in a wrong way

http://stas-blogspot.blogspot.com/2013/02/evil-of-microbench...

> On current processors, POPCNT runs on a single execution port and counts the bits in 4B per cycle. The AVX2 implementation requires more instructions, but spreads the work out across more ports. My fastest unrolled version takes 2.5 cycles per 32B vector, or .078 cycles/byte (2.5/32). This is 1.6x faster (4 cycles per 32B /2.5 cycles per 32B) than scalar popcnt(). Whether this is worth doing depends on the rest of the workload. If the rest of the work is being done on scalar 64-bit registers, those other scalar operations can often fit between the popcnts() "for free", and the cost of moving data back and forth between vector and scalar registers usually overwhelms the advantage.

https://news.ycombinator.com/item?id=11279047

gpderetta · on Aug 25, 2016

> I have a mixed feeling about that incredible progress in modern CPUs branch prediction. Don't they just use idle execution units and run both branches in parallel behind the scenes?

No, I do not know of any production CPU that does it; Branch predictor accuracy today is ~98-99% so it would be a waste of power and execution capability to speculate both paths of a branch.

Also, I think code in average executes 1 jump every 5 instructions and an OoO core can speculate hundreds of instructions in advance, so the number of possible code paths would explode very quickly.