It's interesting that he points out that testing for zero is cheaper than compar...

csense · on Sept 4, 2012

The second version (test against zero) benefits from the fact that the subtraction instruction sets the Z flag automatically. So the end of the second loop is something like "sub i,1" followed by "jnz top".

The most straightforward implementation of first loop would be "add i,1" followed by "cmp i,NUMBER" followed by "jb top" (or "jl top" if i is signed). It's an extra instruction which may be even slower, depending on specific CPU and surrounding code, if NUMBER is a literal or memory access (as opposed to register value).

My guess is that your compiler will produce code that takes advantage of the savings if you turn on optimizations, but you might have to actually do something in the body of the loop to keep the compiler from optimizing the loop completely out (or tweak flags to enable/disable specific optimization techniques). Adding a loop body will make the timing difference less noticeable, as the single-instruction savings of the zero-test version becomes a smaller percentage of the total time per iteration.

Disclaimer: This is how I'd write assembly by hand. A compiler could conceivably do something quite different, depending on the specific application code, compiler, version and flags. I also haven't yet taught myself 64-bit x86 assembly code, although I understand it's rather similar to 32-bit x86 with more registers available.

tsahyt · on Sept 4, 2012

The compiler optimizing out the loop completely is one of the reasons why I didn't turn on optimizations here, although there was a loop body. gcc is smart enough to optimize loops with a fixed outcome away in some cases. I probably should have written a more complex loop body.

That said, it's very likely that the compiler will make sense of such an optimization and produce the quicker assembly code like you just wrote it.

What I wanted to say with my post was really that things like these are heavily depended on architecture and that comparisons could be optimized in hardware.

exDM69 · on Sept 4, 2012

> The compiler optimizing out the loop completely is one of the reasons why I didn't turn on optimizations here

Use a non-constant value for the loop and put a dummy load inside to stop compiler from doing loop unrolling and/or dead code elimination.

I often use time(), rand() or even argc to get dummy values when looking at assembly from compiler optimizations.

dbaupp · on Sept 4, 2012

One can just write the loop as

  for (..; ..; ..) asm("");

GCC doesn't introspect the asm statements, and so it leaves the loop there. (At least, it did for me, even with -O3.)

dbaupp · on Sept 4, 2012

(Just as a note, the count-down condition is normally (i = NUMBER; i--; ), the condition you have gives different results: i is 1 larger than one expects.)

> Without optimizations turned on, gcc translates both of them to cmpl operations followed by a jne

Why would you expect anything different? It isn't doing any optimisations, so it is doing the naive method of transliterating the C to ASM.

The actual test is when one turns optimisations on. The code generated by GCC -O3 for your two conditions is (respectively):

  .L21:
        addl    $1, %eax
        cmpl    %edx, %eax
        jne     .L21

and

  .L13:
        subl    $1, %eax
        jne     .L13

And for the condition I gave above

  .L7:
        subl    $1, %eax
        cmpl    $-1, %eax
        jne     .L7

You can make your own conclusions from that, but clearly your count-down method uses fewer instructions.

(Also, one can test for 0 extremely quickly: just check that all the bits are 0.)

csense · on Sept 6, 2012

AT&T syntax makes me want to gouge my eyes out.

You have to add these suffixes "l" or "w" to all your instructions to tell the compiler how big they are, but of course if you're using EAX you're talking about a 32-bit register; if you wanted to talk about a 16-bit register, you'd say AX, or AL (or even AH) for 8-bit. An assembler that isn't smart enough to figure out something that simple and obvious is a sorry tool indeed.

And then there's the fact that operands are backwards. "subl $1,eax" translates to "eax -= 1" in C-like syntax. Why switch the order of operands? It's completely unnecessary and confusing. It's like someone inventing a programming language where integer literals were negative if specified with no sign, and only positive if you put a + on them -- technically possible, but the sort of thing that serves no useful purpose and defies all prior conventions, common sense, and sanity.

And putting dollar signs on constants and percent signs on registers? I can't even imagine any possible explanation for that, other than a lazy parser author who didn't care about making extra work for their users. It's the type of corner-cutting that you'd expect to find in some prototype thrown together by a single person in a single evening.

Please, for the sake of your own sanity and those around you, use Intel syntax instead!

spc476 · on Sept 5, 2012

It depends on the architecture. I know that a register load on the Motorola 68k CPU will set the flags (Z for zero, N for negative) while a register load on the Intel x86 line doesn't change the flags and thus, a comparison needs to be made.

Both architectures, however, include special instructions to handle numeric loops (for instance for (i = S ; i != E ; i += step) that count downwards, but they have a slightly different ending condition (Intel ends on 0, Motorola ends on -1) and is really specialized on the Intel (it requires the use of the CX (ECX, RCX) register; on the Motorola you can use any of the data registers) so it makes it kind of hard to use in portable C code (your best bet---don't use the for index in the body of the loop and hope the compiler can transform it to use the specialized instructions).

If speed is really critical, you need to check the output from the compiler, and make sure you always measure twice (and keep the original code around as a reference when the hardware changes).

to3m · on Sept 5, 2012

There is no need to compare - ORing the register with itself will suffice to set the Z flag. This is more concise on x86 - i think it's 2 bytes to OR a 32-bit register with itself, vs 3 for a compare with a (sign-extended) 8-bit immediate zero.