WebAssembly Is Not A Stack Machine

azakai · on Oct 8, 2022

> However, locals are a problem. They’re mutable, so you can’t trivially convert them to SSA

As mentioned in previous discussions, that isn't accurate. Wasm locals can be easily converted to SSA, and that is exactly what all major wasm VMs have been doing since wasm launched. Wasm was designed to make that efficient.

Yes, you usually want to do liveness analysis on that SSA form. In theory that information could have been bundled into wasm, at the cost of code size (and a verification scheme). I'm not aware of data showing that that would have been a better tradeoff.

SSA conversion and locals in general are not a significant challenge for wasm VMs, neither in complexity nor generated code quality. As the author admits,

> optimising compilers will probably not generate better code from [possible changes]

There are other areas that actually are challenges for wasm atm, such as memory management (no mmap or even shrinking) and no irreducible control flow (may well be limiting performance on certain benchmarks, but data is lacking).

hinkley · on Oct 8, 2022

By far the biggest change to the Java file format I can remember was when they added this kind of metadata to the files. It was meant to reduce startup time. I think they even made it mandatory at some point, which killed off a minifier I’d been enjoying.

That developed during the era of papers about Proof Carrying Code, which I haven’t heard about in quite some time. So maybe the cachet has gone.

CalChris · on Oct 8, 2022

WebAssembly code consists of sequences of instructions. Its computational model is based on a stack machine in that instructions manipulate values on an implicit operand stack, consuming (popping) argument values and producing or returning (pushing) result values.

https://webassembly.github.io/spec/core/syntax/instructions....

  (module
    (import "console" "log" (func $log (param i32)))
    (func $main
      ;; load `10` and `3` onto the stack
      i32.const 10
      i32.const 3

      i32.add ;; add up both numbers
      call $log ;; log the result
    )
    (start $main)
  )

https://developer.mozilla.org/en-US/docs/WebAssembly/Referen...

Looks pretty stacky to me.

Someone · on Oct 8, 2022

The article’s argument is that it’s not a stack machine because entry of a block is a barrier for accessing what’s on the stack.

(AFAIK, exiting a block also is limited in that it can only return a single item)

You can’t do what’s typical in forths: push intermediate values, enter a loop, use and/or change intermediate values, check for end of loop, loop or exit, read intermediate values.

An example from http://www.murphywong.net/hello/simple.htm#L15):

  : INTEGERS  ( +n -- )
     1            \ Initialize i to 1     ( +n i=1 )         
     BEGIN        \ Start loop: i is TOS  ( +n i )
        2DUP      \ Duplicate 2 items     ( +n i +n i )
        <         \ Is +n less than i ?   ( +n i flag )
        IF        \ Act on flag           ( +n i )
           2DROP  \ True: drop 2 items    (  )
           EXIT   \ True: leave word      (  )
        THEN      \ End IF ... THEN       (  )
        DUP       \ DUPlicate TOS         ( +n i i )
        .         \ Display TOS           ( +n i )
        1+        \ Increment TOS         ( +n i=i+1 )
     AGAIN        \ Loop back             ( +n i )
     ;            \ End definition

Here, both the value of n and the loop counter are on the stack when entering the loop, and the loop removes them before exiting (using 2DROP)

Edit: you also can’t take multiple values from the stack in a loop, as in

  111 108 108 101 72 5 0 DO EMIT LOOP

which, if I didn’t mess things up, prints “Hello”.

garaetjjte · on Oct 8, 2022

>AFAIK, exiting a block also is limited in that it can only return a single item

Multi-value returns were added after MVP but they are supported now.

amelius · on Oct 8, 2022

Looks also very lispy.

Jasper_ · on Oct 8, 2022

WebAssembly started as an AST dump, and Lisp-appearing syntax is a simple way of dumping any tree, ASTs included, so the WebAssembly Textual syntax (WAT) adopted it just because it was easy. Then function bodies went from post-encoded tree form to a linear set of instructions managing a stack, which means there's no reason for a tree encoding anymore. So the Lisp-y looking bits are mostly in the metadata structures around the function bodies.

You could output a lot of formats in a Lisp-y looking way. The language itself has very little to do with actual Lisp.

throwaway81523 · on Oct 8, 2022

From 2019, looks interesting, but discusses some wasm changes that were under consideration at that time. It would be helpful to know if those got implemented. And I thought stack vm's had gone out of style: the current Lua VM is a register VM and they found that code density is about as good as the older stack VM.

sourthyme · on Oct 8, 2022

https://news.ycombinator.com/item?id=19069587 says that at least the first part has been fixed. Maybe this article should be marked 2018 (or 2017?).

cesarb · on Oct 8, 2022

On the third part, talking about the stack pointer:

> “So”, you ask, “why not just reserve a register for this global variable?".

I find it interesting that the 64-bit ARM architecture did go that way. The 32-bit ARM architecture had the stack pointer (and the instruction pointer) in a normal general-purpose register; when developing the 64-bit ARM architecture, which fixed many of the warts of the 32-bit ARM architecture which made it harder to have a high-performance implementation, they moved both the stack pointer and the instruction pointer into dedicated registers. So even in hardware, it made sense to reserve a special register for the stack pointer.

masklinn · on Oct 8, 2022

> So even in hardware, it made sense to reserve a special register for the stack pointer.

I think it was mostly a question of cleanup / clarity of the interface: on ARM32, using R13/SP as a GPR could lead to odd behaviours and was a bit of a portability concern, to say nothing of R15 (PC/IP, which was only a GPR in name IIRC as some instructions excluded it and it had divergent side-effects).

Also unlike every other GPR, R13 (and R14) was banked on all privileged modes (R0-R7 were shared across mode, and R7-R12 were banked in FIQ).

Moving SP and PC to dedicated non-GPR makes the instruction set easier to work with (and allowed register 31 to pull double duty as a virtual reference to SP and ZR depending on instruction, and thus ARM to finally have a zero register).

samsquire · on Oct 8, 2022

I find stack based assembly easier to understand over register based assembly.

I read this article recently: https://www.mattkeeter.com/blog/2022-10-04-ssra/

It's regarding register allocation. According to comments on the HN post for it, it uses the same algorithm that Luajit uses. It uses a technique by searching backwards through the code to work out whether it needs to spill a register from memory or to memory.

I haven't written a bytecode interpreter or a compiler but I wrote a parallel interpreter that is multithreaded. Its code is plaintext so it's not particularly fast but is built with an actor implementation and can send and receive integers between threads. If it supported objects and network requests, it could be useful.

I am yet to implement any stack based operations but I have an idea for an object oriented design similar to Java JVM specification.

Here's the assembly. Colons indicate labels. addv adds values to variables. add adds two variables. send sends messages to a thread number. receive receives a message. The send and receive commands jump to the label indicated if there is no messages.

threads 25

<start>

set running 1

set current_thread 0

set received_value 0

set current 1

set increment 1

:while1

while running :end

receive received_value :send

:send

add received_value current

addv current_thread 1

modulo current_thread 25

send current_thread increment :while1

endwhile :while1

:end

This program gets between 1.5-65 million messages per second. If the data was larger, you could send arbitrary sized objects between threads in constant time so this rate shouldn't change but it doesn't support objects yet.

https://github.com/samsquire/multiversion-concurrency-contro...

Joker_vD · on Oct 8, 2022

What a weird article. The author essentially compiles a huge expression made up entirely of function calls (with opaque definitions, so no inlining opportunities) and zero primitive operations like addition or indexing — compiler writers don't, as a rule, push their effort into optimizing such kinds of workloads. They are much more interested in spending effort on optimizing expressions that involve arithmetic/string operations and indexing of arrays, in inner loops, because that's where most common programs spend their time in.

Also, counting the number of load/stores is useful, but why not run the resulting binary programs and measure actual run times? Compilation times are important, of course, but run times are more important: most computer users are not programmers, they run precompiled binaries, so for them compilation times don't matter.

And a note about the algorithm itself. If you're not in a JIT scenario and can afford to spend a couple of millseconds, you can determine optimal variables to spillusing ILP and then allocate registers using "optimistic coalescing", see [0]. Of course, on a RISC machine with lots of registers you can, most of the time, on usual workloads, pretty much ignore the problem of spilling.

And if you are in a JIT scenario and need very fast register allocator but don't care that much about the efficiency of the produced code... you can use whatever algorithm Wirth have been using in his compilers since the 70-ies: [1], "10. Expressions and Assignments".

[0] Andrew W. Appel, Lal George. Optimal Spilling for CISC Machines with Few Registers. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92...

[1] Niclaus Wirth. Compiler Construction. https://people.inf.ethz.ch/wirth/CompilerConstruction/Compil...

DannyBee · on Oct 8, 2022

Stack machines must be converted to register machines anyway before you can perform most useful optimizations to it, which is linear time or worse, and the process of doing SSA on webassembly in the current form is linear time anyway.

You will not speed anything up by using a stack machine here. The usual reason to use stack machines is to make it smaller and easier to run an interpreter on, not to make it compile or optimize faster

saagarjha · on Oct 8, 2022

Past discussion: https://news.ycombinator.com/item?id=24303182

masklinn · on Oct 8, 2022

Probably better to link directly to the original thread, as it had WASM contributors chiming in: https://news.ycombinator.com/item?id=19069587

tekknolagi · on Oct 8, 2022

The comments by Pizlo and Titzer are pretty illuminating

bluedays · on Oct 8, 2022

This is besides the point in regards to the article but I think it’s fascinating that we’ve abstracted away everything from assembly to the point where we start making new virtual machines line the JVM and webassembly. It’s like the snake is eating itself.

masklinn · on Oct 8, 2022

I'm not sure what year you currently live in, but for what it's worth this approach dates back to the mid-70s, with Pascal's p-Systems, or chip8.

kragen · on Oct 10, 2022

01949: https://en.wikipedia.org/wiki/Short_Code_%28computer_languag...

brundolf · on Oct 8, 2022

I think this needs a (2017)