Fast RISC-V-based scripting back end for game engines

kouteiheika · on Jan 15, 2024

Nice to see another RISC-V based VM!

I'm working on something similar, but instead of optimizing for low call latency I'm prioritizing security, execution performance and compilation speed. I'm currently at roughly the same execution speed as wasmtime, but with over 200x faster compilation, and with significantly better security sandboxing.

RISC-V (well, a slightly modified variant) actually makes for a really good VM bytecode!

brucehoult · on Jan 16, 2024

Does "slightly modified" mean RV32EM (which is actually completely standard, supported by upstream gcc etc), or something else?

kouteiheika · on Jan 16, 2024

Yes and no.

RV32E is a big part of it since it only effectively uses 13 registers which makes it possible to write a recompiler that doesn't have to spill any of the them to memory on AMD64.

But there are a few other aspects of RISC-V which are suboptimal for a VM like mine, which I handle by postprocessing the standard ELF binaries with a custom linker. Some of the major things are:

- The instruction encoding format. While it's great for hardware execution, for a VM it can be both simplified and made more compact. For example, I encode branches as: a byte with an opcode + a byte with two 4-bit registers packed + a varint with the absolute address to jump to. This is more space efficient, naturally supports fully sized immediates and doesn't require the crazy bit shuffling to extract them.

- The limited range of immediates. Due to the use of varints I support full length immediates, and I macro-fuse LUI + ADDI etc. in my linker into a single ADDI.

- The zero register. This is only a problem because I must support AMD64 as a target, and there just isn't a free register there where I could park a zero in. This can be special-cased in the VM, but it is kinda tricky and annoying (and a lot of these cases are never emitted by the compiler, so you have a bunch of complexity and effectively useless dead code). So what I do is that I disallow the zero register in the bytecode, and when postprocessing the code in my linker I either just replace those instructions (e.g. if there's a MUL which uses the zero register as one of the source registers I compute the result and put an immediate load there instead), get rid of them (e.g. if the instruction uses it as a destination then it's a NOP) or convert them into new instructions which take any immediate and not only a zero.

- The alignment requirements of code addresses. I don't load any of the code into guest visible memory (it's effectively a Harvard architecture VM), and I virtualize code addresses to not address instructions individually, but basic blocks (so e.g. an address of "4" points to the 1st basic block, "8" points to the 2nd, etc.). This has a couple of benefits: it reduces the attack surface (you can't ROP your way around anywhere, only to the start of basic blocks), is more compact to encode (since there are less basic blocks than instructions the addresses are smaller and you need less bits to encode them as varints), makes it possible to efficiently emit code for basic block based gas metering in a single-pass without having to analyze the control flow, etc. Unfortunately I still need to handle the case of "what if someone dynamically jumps to address 3", and there's no way to work around this issue in my linker, so at run time I have to create a jump table (for translating guest addresses into host addresses) that's 4 times bigger than necessary. This wastes memory and introduces extra cache misses. (I can later drop this down to 2 if I add support for the C extension, but that will still make it twice as big than necessary.)

---------

So my number one wish for RISC-V that'd make it nicer for VMs (and wouldn't be VM specific) would probably be an extension which supports full length immediates and forces the minimum alignment of code addresses to be 1.

dualogy · on Jan 15, 2024

Repo link by any chance?

erwan577 · on Jan 15, 2024

https://github.com/koute/polkavm

SotCodeLaureate · on Jan 14, 2024

Recently I wrote a simplistic RISC-V interpreter for educational purposes mostly, and a friend was benchmarking it against several scripting system in the context of game-character control routines. Well, it's quite fast for what it is, though Lua is still somewhat faster, heh.

Some benchmarking code is here: https://github.com/glebnovodran/roam_bench

fwsgonzo · on Jan 14, 2024

Very cool! One tip: Take the execute segment and produce fast bytecodes instead of interpreting each instruction as a bit pattern. Producing faster bytecodes is something that even WASM emulators do, despite it being a bytecode format.

SotCodeLaureate · on Jan 14, 2024

Thanks! Yeah, this one is supposed to be a very simple implementation for a kind of "how to write a machine code interpreter" tutorial, but I was looking to experiment with some optimizations if time permits. Patching with pre-baked alt-bytecode was one of the ideas indeed.

FPGAhacker · on Jan 14, 2024

What’s the difference between a bit pattern and a bytecode?

fwsgonzo · on Jan 14, 2024

The fast bytecode is reduced to a simple operation that excludes certain knowns. For example if you have an instruction that stores A0 = A1 + 0, then knowing the immediate is zero, this can be reduced from reading a complex bit pattern to a bytecode that moves from one register to another, basically MV dst=A0, src=A1.

FPGAhacker · on Jan 17, 2024

So what would a bit pattern be in this case? I ask because I’m probably being too simple, but from a certain point of view, it’s all bits.

dataangel · on Jan 14, 2024

This was posted before and I still have no idea what the rationale is. No desktop PC or game console is RISC-V, so if I'm going to all the trouble to use a scripting solution that requires me to compile my scripts to machine language, why would I target RISC-V? Why wouldn't I just compile to x86-64 directly?

Like, what is the vision here, a LuaJIT that targets RISC-V, running inside my x86-64 game, where the emulator translates the RISC-V back into x86-64... for reasons? Just isolation?

doctorpangloss · on Jan 14, 2024

> ...for reasons?

Yes. When something's intellectually stimulating you work on it more, that's it.

Most game development is a grind. These huge distractions, like a scripting backend, well if you spend 100h working on the scripting engine only to spend 1h authoring actual scripts, you still spent 1h authoring scripts those 2 weeks instead of 0h. The game gets delivered sooner even if you spend 10x as long working on it.

It's a quintessential misunderstanding about indie game development. HN readers think Jonathan Blow is wasting his time writing a whole new programming language and engine, and that's why his games take 6 years to make. No: his games would take 20 years to make if they weren't intellectually engaging to make. They wouldn't be made at all!

It's the same energy as rewriting everything in another programming language.

I think this happens at giant companies too, all the time. So called Not Invented Here syndrome: it's as much about laundering open source code as it is about keeping things interesting enough to make the extreme boredom and grind worth it for otherwise smart and healthy people.

fwsgonzo · on Jan 14, 2024

The sandbox is interpreting RISC-V, just very quickly. It's platform independent. I am actually making a game with a derivative of this repo, and in the server I am building the programs at the same time as the server. Once the server starts, it loads all the programs, and then sends compressed programs to each client, so that everyone who connects has the same scripts. Easy and convenient, but most likely just that because I did it from the start.

adastra22 · on Jan 15, 2024

Ok… so why not send llvm bytecode? Or JVM? Or JavaScript? Or WebAssembly?

Any of those would be better supported and have a faster, battle tested JIT engine.

mort96 · on Jan 15, 2024

Man it's so frustrating as an author to explain you you didn't just use existing technology X in the linked article but then be met by a flood of HN comments from people asking "but why didn't you just use existing technology X?"

adastra22 · on Jan 15, 2024

1. I read the linked README and it did not answer these points.

2. Regardless, this sub thread isn’t about TFA.

phero_cnstrcts · on Jan 15, 2024

For the same reasons that I don’t write my websites in React.

speps · on Jan 14, 2024

This is an interesting concept, getting the gameplay logic sent to the client this way. I guess it only works for simpler games so far, do you have any code examples?

fwsgonzo · on Jan 14, 2024

The game is quite complex actually, but the script is not doing overmuch right now. The script is doing things that makes sense for a growing modding API, while the engine still does the brunt of the work. That said, there are functions that end up being called billions of times simply because you want that flexibility, and that's where the low latency script pays off.

azakai · on Jan 15, 2024

The project explains that the reason is latency.

It is indeed true that most VMs, including JavaScript, WebAssembly, the JVM, etc., have significant latency on the boundary. For example, many wasm VMs will install a signal handler, and that needs to be enabled/disabled on each entry/exit from the VM.

But the latency can be fixed. You can sandbox WebAssembly without a signal handler, in particular (at the cost of a few % throughput overhead for bounds checks). I'm not sure if the author benchmarked that, but it should be very fast.

(As for "why RISC-V" in the project, it looks like that's because it's easy to write an interpreter for, but it could have been any compiler target, it seems.)

fwsgonzo · on Jan 15, 2024

There's also latencies for arguments into and out from the emulators. I'm not sure the custom signal handling alone can account for that.

I agree that it could as well have been MIPS or ARM.

sspiff · on Jan 15, 2024

I think there are a couple of reasons for picking RISC-V over others.

It is the youngest of the lot, meaning it is carrying around the least legacy compatibility stuff while still incorporating a lot of modern design lessons.

Additionally, modern ARM and MIPS are not free and (potentially?) patent encumbered as well.

You could go with some old, free MIPS or ARM, but then how good is performance, and how good is compiler support for these nowadays?

azakai · on Jan 15, 2024

There is no reason I can think of why a Wasm interpreter would have higher overhead for arguments in and out. Maybe I'm missing something?

Wasm was designed to have very fast calls, both internally and externally. That's why the basic types correspond directly to machine types. The goal is to have 0 overhead, except for sandboxing.

Regarding sandboxing, there is the stack overflow check that happens in Wasm on each call. But you can disable that check if you don't want it.

Btw, you can see that Wasm has no extra overhead on calls in a simple way: compile Wasm using wasm2c and inspect the C output. That C is what a Wasm VM would run. Some details:

https://kripken.github.io/blog/wasm/2020/07/27/wasmboxc.html

Note the section there about LTO: The Wasm can even be inlined into the code calling into it, which means there is zero call overhead. (But even without LTO, it's just another C function to call.)

fwsgonzo · on Jan 15, 2024

That's definitely a cool project, but I'm only interested in interpreting right now. I am sharing programs with clients, as I wrote in another thread. Not interested in producing or linking native code. I already have a binary translation mode, and I could probably expand it to avoid the interpreter loop to save 2ns, but I just don't have a need. Right now, everything is perfect for my use case.

It's cool that you can inline sandboxed functions like that. I'm curious about how you arbitrarily select and call functions though? A string-lookup kind of thing?

azakai · on Jan 15, 2024

In general you'd have a map between string names and functions, yes. But for static calls you don't need a map lookup of course.

nagisa · on Jan 14, 2024

The answer lies in the reason why you would use an embedded scripting language/environment at all, over just loading native code plugins – sandboxing, fault isolation and such.

RISC-V seems to be used here more as a bytecode format of sorts, and at least compared to x86_64 it should be much easier to implement (first of all because all of the opcodes have the same size.)

yjftsjthsd-h · on Jan 15, 2024

Using a sandboxed bytecode VM has merit, but I would naively expect wasm to be far better at it since RISC-V was designed for hardware and wasm was designed for exactly this usecase (sandboxed code running in another application with good performance and isolation).

(Although... I do see value in having a second option. I guess it might be worth seeing if it can actually be better, even though it's not the thing I would have written)

MobiusHorizons · on Jan 15, 2024

I mean sure, but virtualization is a very robust (and much lower overhead) technology that is readily available (and supported by the processor). If you just want isolation, why wouldn't you use that? You still have to ship a compiled binary after all.

eru · on Jan 15, 2024

If you just want to run some CPU bound code as fast as possible, virtualisation has low overhead.

But if you are doing a lot of communication between the script and the host system, I'm not sure virtualisation does so well?

fwsgonzo · on Jan 15, 2024

No, as the author of TinyKVM, the overheads of entering and leaving KVM is not the same order of magnitude as SFI. TinyKVM requires 2-3 microseconds to enter and leave, which is the lowest I ever managed to get while it still being robust and dependable sandbox. For some, that might seem really low, but keep in mind that libriscv which is the backbone of RVScript has 3 nanoseconds overhead.

eru · on Jan 16, 2024

Thanks!

zozbot234 · on Jan 14, 2024

RISC-V is quite fast as far as whole-platform emulation goes, but yes using it for plugin code is a bit weird, WASM would seem to be preferable for that use case. The interfacing story of WASM is not quite perfect (you're limited to a C-like baseline API/ABI, since the WASM components standards is not finished yet) but it's not like RISC-V is any better.

ur-whale · on Jan 15, 2024

> WASM would seem to be preferable for that use case.

Doesn't he explicitly address WASM as an unfit backend in the article ?

fwsgonzo · on Jan 14, 2024

This is completely ignoring that RVScript has really nice APIs both ways. While being lower latency.

saagarjha · on Jan 14, 2024

Yeah, it seems like WASM would be a better choice?

h0l0cube · on Jan 15, 2024

Rationale from the README:

> Lua, Luau and even LuaJIT have fairly substantial overheads when making function calls into the script, especially when many arguments are involved. The same is true for WebAssembly emulators that I have measured, eg. wasmtime.

MaxBarraclough · on Jan 14, 2024

Or transpiling to C/C++, like Nim.

crq-yml · on Jan 15, 2024

At large scale, game engines have relatively unsatisfying answers to "what's the best way to iterate quickly on the game while supporting all the features we need".

If the scope is small, the answer is easy: write some naive code, incrementally profile. The project isn't big so full rebuild times can stay light, and your team isn't large so you really can "do whatever" and get somewhere, especially if you take the route of writing in a native language that compiles fast - a Pascal or something more "new and hip" like Beef.

But when you are asked to do it on a AAA project you end up with every imaginable kind of feature: it needs to support a team of hundreds, it needs to be fast on console hardware, it needs to be straightforward to debug, it needs to support modding, it needs to be fast to iterate on, it needs to be flexible about the memory layouts of save data or assets.

So you do end up in this kind of space where you're like, "we'll target a VM that is really low level, and that'll reduce the friction and granularity of switching between debug and release profiles, and that lets us stay in control of when we want to sandbox and when we want to go fast and we'll still be able to control every byte". That it happens to be RISC-V is not super relevant - it could be WASM or a custom bytecode like the Hashlink target in Haxe. It just needs to be a thing to compile to, that you can reasonably expect to implement and maintain.

It's not the only approach that could be taken. Downscoping the ambition by 50% and hardcoding a little more of your spec is a good way to get through the technical stuff 10x faster.

mort96 · on Jan 15, 2024

The readme literally explains why the author doesn't want to use LuaJIT though?

snvzz · on Jan 15, 2024

I wonder whether they've accounted for running it on actual RISC-V hardware.

It could definitely still emulate itself, but it'd be far higher performing to run the code directly in some sort of sandbox.

logicprog · on Jan 14, 2024

As someone who's working on a game engine that's designed to be highly data-driven, with only the core game engine itself in a systems programming language and a lot of calls out to scripts, this is very intriguing! C++20 doesn't seem like a good candidate for an actual scripting language, so I'd have to find a suitable scripting language that can compile to RISC-V, though, and of course requiring precompilation is an issue.

fwsgonzo · on Jan 14, 2024

Yes, precompilation is not optional. I've used many languages in sandboxes over the years developing both this and other emulators.

I really like Nim the most. It's Python spiritually, but with types. Types just really really necessary, and it does have FFI support so you can forward arguments either from a wrapper or use {.cdecl} directly. It's still not a barneskirenn (as we say in Norway), but definitely the most fun I've had using another language in my various sandboxes.

Nelua and Zig are close seconds. Both are just so easy to work with C-based FFI. Nelua is probably not safe to use, as it's still a work in progress. At least last I checked. Zig is very much ready to use. Maybe not your cup of tea, though.

Golang has a complex run-time and I don't recommend using it. It's definitely possible though as my sandbox does have a full MMU. Just expect integration to be long and ardous. And the ABI changes potentially every version.

Rust is one of the easier ones. I didn't find it fun to work with though. It has by far the best inline assembly, but too much fighting with the compiler. I know that people love Rust, and it is fully supported in my emulator.

C/C++ has the benefit of having the ability to have their underlying functions overridden by native helper system calls. Eg. replacing memcpy() with a system call that has native performance. It's too long a long topic to talk about here, but Nim and Nelua also falls under this umbrella along with other languages that can compile to C/C++.

Kotlin is definitely possible to use. A bit hard to understand how the native stuff actually works, but I did manage to run a hello world program in several sandboxes with some effort. I'm not 100% sure but I think I managed to convert a C API header directly to something kotlin understands using a one-liner in the terminal. I would say it scores high just on that. Again, just a bit hard to understand how to talk to Kotlin from an external FFI looking in.

JavaScript is of course possible with both jitless v8 and QuickJS. v8 requires writing some C++ scaffolding + host API functions, and QuickJS requires the same in C. Either works.

And I think that's all the languages I have tried. If you think there's a missing language here, then I would love to try it!

logicprog · on Jan 14, 2024

This is a really interesting list, thank you for replying!

(the following are entirely undirected musings on using your technology for my game engine, mostly in case anyone finds them interesting or has something to suggest that I haven't thought of since I'm new at this)

At least for my use case C, Zig, Rust, and even probably Nim are out of the question because I'm not just using scripting for sandboxing capabilities and such, I'm also using it because I want people to be able to program games and write mods in a high-level language, so using a systems programming language as my scripting language kind of feels like it defeats part of the purpose and I might as well just use dynamic linking with a C ABI or something crazy like that (I'm new to this so excuse me if that's nonsense lol). JavaScript and Kotlin are intriguing though, because they aren't systems programming languages, so I'll have to think long and hard about those;

I've been considering C# for scripting lately (because I like it well enough, it's widespread, and known in the gaming world) and I wonder if you can get natively aot compiled Kotlin to work, if you could get natively aot compiled C# to work too... there would probably be similar complications due to the large and complex runtime, but I know you can strip it down so I wonder how that might play in. I also wonder what the performance would be like compared to just hosting the dotnet runtime, which is what I was intending to do before I saw your post. Maybe at some point I should set up a benchmark to compare! Although I have to say the pre-compilation thing is a bit of a deal-breaker for me, since I really want people to be able to put plain text things directly in the game folder and see those changes in the engine without having to do any kind of build process, which is something that is achievable with the regular dotnet runtime using a precompiled assembly bootstraps the rest of the sctipts using dynamic assembly compilation and loading to collect everything else in the script directory.

fwsgonzo · on Jan 14, 2024

Hm, if you don't actually need a sandbox then I think just using the C# run-time makes a lot of sense. I also like C#, but I've never been in a situation to try the fairly new AOT support. Sounds like a good idea, though. C# is a very good language.

brucehoult · on Jan 15, 2024

C# runtime is HUGE and not ported to everything. The compilation step to native code takes a significant amount of time.

Rohansi · on Jan 15, 2024

Do you mean compiling the .NET runtime (CoreCLR) itself or AOT compiling C# code? Because I recently tried AOT compiled C# and that builds quickly. And you don't generally need to compile the .NET runtime when embedding it because you can just load the builds they ship.

brucehoult · on Jan 15, 2024

Simply running a DotNET dll. I just tried a basic "Hello World" program compiled to a CIL .dll and to RV64 machine code, on my x86 Linux machine (original ThreadRipper 2990WX) and then running it through different JIT/interpreters. Wall time:

DotNET: 0.061s

qemu-riscv64: 0.006s

Spike: 0.031s

Qemu is JIT with Linux syscall layer built in (in native code).

Spike is a RISC-V interpreter with Linux syscall layer provided by interpreted RISC-V "pk"

So the DotNET JIT has quite a high overhead for startup, or one-time code.

For very compute-intensive code DotNet has an advantage. e.g. on my own primes benchmark (https://hoult.org/primes.txt, https://hoult.org/primes.cs)

DotNET: 3.5s

qemu-riscv64: 10.2s

gcc: 9.7s (i.e. defaulting to -O0)

gcc -O1: 3.2s

DotNET beats emulated RISC-V here, but the RISC-V emulator (with RISC-V code compiled with -O1) is pretty much as fast as a lazy person compiling C to native x86 gets.

Rohansi · on Jan 15, 2024

Oh, so you meant startup time for the runtime, JIT, etc. then. That's not exactly relevant to usage in game engines here because it is only done once.

If you want to compare startup times you should be using .NET's AOT compilation so it doesn't need to load the full runtime or do any JIT compilation. See here: https://learn.microsoft.com/en-us/dotnet/core/deploying/nati...

logicprog · on Jan 15, 2024

Good to know. That solidifies my plan to embed the dotnet runtime in my engine and load JIT code, not compile C# to AOT and then run that through an emulation layer

lmm · on Jan 15, 2024

I'd be interested to see Scala (native).

sylware · on Jan 14, 2024

Funny, I just started (yesterday!!) something similar: a kind of x86_64 (which I call x64) virtual machine for rv64.

I am writting this "virtual machine" in x86_64 assembly though (linux ABI). I don't plan to have native x86_64 compilation but only an interpreter.

This idea is to code rv64 executables(linux) which I could "run-ish" on x86_64(linux).

There is also the real machine emulator from M.Bellard : https://bellard.org/tinyemu

Well, RISC-V is gaining momentum, and transition apparatus from x86_64 is taking shaped from many individuals wishing risc-v to be a success.

BTW, if anybody knows about milk-v duo (the one with the SOC free from arm cores) resellers in Europe I can pay without a credit card and I can contact with a self-hosted email... thank you.

sitkack · on Jan 14, 2024

That sounds exciting, please keep us posted.

sylware · on Jan 15, 2024

I will try to do it, but there is a huge trade-off: elf is gone.

sitkack · on Jan 16, 2024

What do you mean by elf is gone?

You might be able to use this, https://github.com/cole14/rust-elf/

sylware · on Jan 22, 2024

I did remove elf period, I use a 'json/wayland' like format, namely excrutiatingly simple for modern CPUs. Additionnaly, there is no rust anything here, or any other compiler based language, it is assembly, x86_64 and risc-v assembly (I even use a x86_64 intel syntax which does assemble on 4 assemblers).

I had the time to start to think how to do that, I was hit hard by the LR/SC emulation/interpretation issue. Those LR/SC pairs are extremely dangerous and can livelock easily, and understanding the theorical puzzle of the specs which seems to try to fix that is meh (it must be extremely rigorous on the theorical prerequisites for this very, as this is so much critical and does not seem to be): "the eventual success of store-conditional instructions". The theorical puzzle I am trying to figure out seems to miss a rigorous definition of the "failing of store-conditional instructions".

In the end, it seems proper emulation on x86_64 will be very slow as each store/load will have to go thru some expensive code in order to deal with lr/sc. But that's fine, as I am more interested in running correctly risc-v code than omega quickly (intel and amd have nastily failed at providing lr/sc semantics hardware instructions to x86_64, see the Intel TSX debacle). I may implement a full virtual memory controller, significant performance penalty will happen for the sake of correctness.

charcircuit · on Jan 14, 2024

When writing a web assembly interpreter I was surprised at how it didn't map cleanly to the underlying hardware. The interpreter had to keep track of the types of everything. I can definitely see how a RISC-V based solution would be faster.

klodolph · on Jan 14, 2024

I’m curious what you’re talking about—WASM has different instructions for each of its different types. There are separate instructions for i32, i64, f32, and f64 in the base spec.

This seems like a pretty clean map to me—most of the architectures these days have native support for 32-bit and 64-bit types these days. You only get 16-bit and 8-bit support with SIMD or load/store, much like how WASM does it.

fwsgonzo · on Jan 14, 2024

He might be talking about having to implement a register allocator/file in order to interpret stack machines faster. Something you get from the compilers on register architectures like RISC-V, ARM and x86.

wasmtime uses cranelift, which has a complex register allocator. wasm3 uses a simple register file.

> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach.

https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md

miohtama · on Jan 14, 2024

Because registers on x86 and RISC-V are different, you still need to have a register file to map those, no?

fwsgonzo · on Jan 14, 2024

Only for binary translation or just-in-time compilation (basically producing native code). If you're interpreting RISC-V you can just pretend the registers are an array of register-sized integers. Which is what they are.

charcircuit · on Jan 14, 2024

>registers are an array of register-sized integers

A "register file" is the name of such an array.

charcircuit · on Jan 14, 2024

Take for example i32.add you have to

1. Check the type at the top of the stack to make sure it's i32 and trap otherwise

2. Pop the top of the stack

3. Check the top of the stack and make sure it's i32 and trap otherwise

4. Pop the top of the stack.

5. Add the two values together

6. Push the result to the top of the stack along with its type.

Compare this to ADD from RISC-V where you can just add the contents of 2 source registers and store it in a destination register. There is not a bunch of type book keeping or alignment that you have to worry about.

klodolph · on Jan 14, 2024

Most of this can be done statically. You do the bookkeeping once. There are various approaches—you can try to allocate registers, or you can emit code that loads/stores on the stack, or you can just run an interpreter and remove the safety checks from it.

The stack isn’t dynamically typed, so it doesn’t make sense to trap.

charcircuit · on Jan 14, 2024

Looking into it further the type checking is supposed to be done in a dedicated validation step and other than validation using instructions with the wrong type of data seems to work.

Findecanor · on Jan 14, 2024

WASM is a stack-machine, so the input code needs to be validated to make sure that the same types gets pop'ed as has previously been push'ed. But that can be done ahead of execution.

You'd need to validate function types during runtime though, which is something that a CPU emulator does not need to.

klodolph · on Jan 14, 2024

WASM is kind of a stack machine, yes, but there are a lot of constraints on the stack machine that are designed to make it easy to translate to registers.

If you squint and look sideways, you can think of it as a flattened tree, rather than a stack machine.

snvzz · on Jan 15, 2024

In contrast, RISC-V would be considered a register-based VM here.

dkjaudyeqooe · on Jan 14, 2024

I'm not familiar with the details, and don't have the references to hand, but WASM has been designed in a way that reflects the design choices of Chrome's Javascript engine rather than optimizing standalone compilation.

klodolph · on Jan 14, 2024

Part of optimizing for the design choices of browser engines is to make it easy to emit machine code. The browsers have multiple compilers in them, and one of the compilers has the task of emitting simple machine code quickly, so the code can run immediately (reducing start-up time). In V8, this initial compiler is called Ignition.

coldtea · on Jan 15, 2024

>WASM has been designed in a way that reflects the design choices of Chrome's Javascript engine

Wasn't it based on previous work by the Mozilla team?

FpUser · on Jan 14, 2024

Long time ago I've used paxCompiler scripting engine. It was made for Delphi / FreePascal and supported Delphi input language. Rather than being byte code interpreter / JIT it actually generated native code in RAM hence the overhead of calling it was the same as calling regular function. It had lots of other super nice features but that is all in the past anyways.

I am curious why such approach is not used for other scripting projects.

SpaghettiCthulu · on Jan 15, 2024

> Rather than being byte code interpreter / JIT it actually generated native code in RAM hence the overhead of calling it was the same as calling regular function.

I'm not sure what you mean. What you've just described is a JIT compiler.

FpUser · on Jan 15, 2024

Not in a way V8 does it for example. I would call it AOT (ahead of time). Normally you first compile and initialize scripts and then you can call into their functions and the other way around. But sure you can also compile and execute function in one step so be my guest and call it whatever you want.

stefanha · on Jan 14, 2024

Is there a benchmark available where you compared Lua and wasmtime against your interpreter? It would be interesting for Lua and wasmtime developers to chime in on why the world switch overhead is higher and whether there exist configuration settings or APIs that eliminate the overhead for your use case.

I'm curious if there is any fundamental reason why they have to be slower and I suspect the answer is "no".

fwsgonzo · on Jan 15, 2024

Binary translated libriscv vs LuaJIT: https://gist.github.com/fwsGonzo/9132f0ef7d3f009baa5b222eedf...

Interpreted libriscv vs LuaJIT: https://gist.github.com/fwsGonzo/1af5b2a9b4f38c1f3d3074d78ac...

libriscv vs. Luau: https://gist.github.com/fwsGonzo/5ac8f4d8ca84e97b0c527aec76a...

wasmtime: https://medium.com/@fwsgonzo/using-c-as-a-scripting-language...

You're probably right that things can be improved. But my implementation has the aforementioned latencies right now.

For the benchmarks I did my best writing good Lua, but I am not a pro. Call overhead is also subtracted out of every benchmark after call overhead is measured. So that extra overhead is a permanent feature of these implementations. I also noticed that most of my script functions have at least ~3-4 arguments, which seems like it would be quite costly. Just look at the 8x arg benchmark and divide by 2, I guess. Another 100ns right there. Baseline 150ns on a micro benchmark is going to become at least a microsecond in prod when everything is random and semi-cold.

stefanha · on Jan 15, 2024

Cool, thanks for sharing this information! I'm not a Lua or wasmtime developer, but I hope to learn from whatever discussion follows.

xgkickt · on Jan 15, 2024

I’ve thought of doing something similar myself. Some of the advantages as I see them are language agnosticism, simple and easy to read script bitcode making a debug interface easier to write and maintain. By having the script use the same language as the rest of the game, moving functions between static and dynamic compilation can help with iteration and performance.

hammyhavoc · on Jan 14, 2024

Genius.

thesnide · on Jan 14, 2024

Yep, I also thought about it to replace wasm to enable secure server side code execution.

https://blog.pwkf.org/2023/07/16/lambda-mcu.html

hammyhavoc · on Jan 14, 2024

This is extremely cool!

zamalek · on Jan 15, 2024

> The same is true for WebAssembly emulators that I have measured, eg. wasmtime.

Wait, so this claims that it (an interpreter from a cursory akim of the code) is faster than a compiler?

azakai · on Jan 15, 2024

No, it says that that latency of calling into the VM can be lower in a simple interpreter compared to an optimizing VM (which might need to do things on the call boundary).

Gazoche · on Jan 15, 2024

That's an interesting idea, I wonder how it compares to WASM in concrete benchmark numbers.