Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your optimizing compiler today will actually optimize. LLVM was recently ported to the 6502 (yes, really) [1]. An example:

    void outchar (char c) { 
        c = c | 0x80;
        asm volatile ("jsr $fbfd\n" : : "a" (c): "a");
    }

    void outstr (char* str) {
        while (*str != 0) 
            outchar(*str++);
    }

    void main () {
        outstr("Hello, world!\n");
    }
That is compiled to this:

    lda #$c8       ; ASCII H | 0x80
    jsr $fbfd
    lda #$e5       ; ASCII e | 0x80
    jsr $fbfd
    ...
Unrolled loop, over a function applied to a constant string at compile time. An assembler programmer couldn't do better. It is the fastest way to output that string so long as you rely on the ROM routine at $fbfd. (Apple II, for the curious.) Such an optimizing transform is unremarkable today. But stuff like that was cutting edge in the 90s.

[1] https://llvm-mos.org/wiki/Welcome



I understand your point, but LLVM-MOS is a bad example. You gain LLVM’s language optimizations, as you point out. But LLVM’s assumed architecture is so different from the 6502 that lowering the code to assembly introduces many superfluous instructions. (As an example, the 6502 has one general purpose register, but LLVM works best with many registers. So LLVM-MOS creates 16 virtual registers in the first page of memory and then generates instructions to move them into the main register as they are used.) It’s of course possible to further optimize this, but the LLVM-MOS project isn’t that mature yet. So assembly programmers can still very much do better.


So LLVM-MOS creates 16 virtual registers in the first page of memory and then generates instructions to move them into the main register as they are used.

Isn’t this actually good practice on the 6502? The processor treats the first page of memory (called the zero page) differently. Instructions that address the zero page are shorter because they leave out the most significant byte. Addressing any other page requires that extra byte for the MSB.

Furthermore, instructions which accept a zero page address typically complete one cycle faster than absolute addressed instructions, and typically only one cycle slower than immediate addressed instructions.

So if you can keep as much of your memory accesses within the zero page as possible, your code will run a lot faster. It would seem to me that treating the zero page as a table of virtual registers is a great way to do that because you can bring all your register colouring machinery to bear on the problem.


I understand your point but the beginning of the zero page is almost always used as virtual registers by regular hand-rolled 6502 applications. So it's pretty normal for LLVM to do the same, it's not an example of LLVM doing something weird.


Not really that wonder, other that 6502 sucks for C.

"An overview of the PL.8 compiler", circa 1976

https://dl.acm.org/doi/abs/10.1145/989393.989400


Does it end that code with a

  jmp $fbfd

?


Doubtful, given the JSR comes from an inline asm. You’d need to code the call in C (with an appropriate calling convention, which I don’t know if this port defines) for Clang to be able to optimize a tail-position call into a jump—which it is capable of doing, generally speaking.


No. The compiler knows that trick for its own code :) not sure about introspecting into the assembly (I think LLVM doesn't do that). But either way, standard C returns int of 0 from main on success. So: ldx #0 txa rts


Nitpick: this isn’t standard C (it uses void main, not int main)

Nitpick 2: why ldx #0 txa rts? I would think lda #0 rts is shorter and faster

Back to my question: if it can’t, the claim “an assembler programmer couldn't do better” isn’t correct.

I think an assembler programmer for the 6502 would consider doing a jmp at the end, even if it makes the function return an incorrect, possibly even unpredictable value. If that value isn’t used, why spend time setting it?

A assembly programmer also would:

- check whether the routine at 0xFBFD accidentally guarantees to set A to zero, or returns with the X or Y register set to zero, and shamelessly exploit that.

- check whether the code at 0xFBFD preserves the value of the accumulator (unlikely, I would guess, but if it does, the two consecutive ‘l’s need only one LDA#)

- consider replacing the code to output the space inside “hello world” by a call to FBF4 (move cursor right). That has the same effect if there already is a space there when the code is called.

- call 0xFBF0 to output a printable character, not 0xFBFD (reading https://6502disassembly.com/a2-rom/APPLE2.ROM.html, I notice that is faster for letters and punctuation)

On a 6502, that’s how you get your code fit into memory and make it faster. To write good code for a 6502, you can’t be concerned about calling conventions or having individually testable functions.


I bet that sizeof(int)==2 - which immediately tells you everything you need to know - and the return value from a function has 8 bits in X and 8 bits in A. So ldx#0:txa is how you load a return value of (int)0.

Regarding this specific unrolled loop, I would expect a 6502 programmer would just write the obvious loop, because they're clearly optimizing for space rather than speed when calling the ROM routine. They'll be content with the string printing taking about as long as it takes, which clearly isn't too long, as they wouldn't have done it that way otherwise. And the loop "overhead" won't be meaningful. (Looks like it'll be something like 7 cycles per character? I'm not familiar with the Apple II. Looks like $fbfd preserves X though.)


We did a lot of loop unrolling and self modifying code back in the day, when making demos for the C64. The branch is really expensive. For example, clearing the screen you might use 16 STA adr,x and then add 16 to X before you branch to the loop.


Indeed, in some cases you want the unrolls. The 6502 is good in the twisties, but if you're trying to do any kind of copy or fill then the percentage of meaningful cycles is disappointingly low, and the unroll may be necessary. Also, if you're trying to keep in sync with some other piece of hardware, then just doing it step at a time can be much easier.

I have done a lot of all of this sort of code and I am quite familiar with the 6502 tradeoffs. But for printing 15 chars by calling a ROM routine, I stand by my comments.


Yes, I compiled with -O3 for maximum speed. That would be an unusual flag choice in most cases.

I just wanted to use 6502 code (so many seem to be able to read it!) with C side by side. x86 would have worked as well. Where the fastest answer would also be the same construct, assuming the dependency on an external routine.


> Nitpick: this isn’t standard C (it uses void main, not int main)

You know what, I'm gonna nitpick that nitpick: void main() is fully allowed on a freestanding target, which is still standard C.

Given the C standards historically generous interpretation of undefined behaviour and other miscellany, I think it's a reasonable interpretation of the standard to pretend that a target that allows something other than int main(...) is freestanding rather than hosted, and therefore fully conforming.


Yep, llvm-mos-sdk is explicitly freestanding; the libc functions in the SDK follow the hosted C standard, but they don't add up to a hosted implementation. The only known C99 non-compliance is the lack of floating point support, which is actively being worked on.


Well... I mean... if you want inverse video.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: