I'm working on something similar, but instead of optimizing for low call latency I'm prioritizing security, execution performance and compilation speed. I'm currently at roughly the same execution speed as wasmtime, but with over 200x faster compilation, and with significantly better security sandboxing.
RISC-V (well, a slightly modified variant) actually makes for a really good VM bytecode!
RV32E is a big part of it since it only effectively uses 13 registers which makes it possible to write a recompiler that doesn't have to spill any of the them to memory on AMD64.
But there are a few other aspects of RISC-V which are suboptimal for a VM like mine, which I handle by postprocessing the standard ELF binaries with a custom linker. Some of the major things are:
- The instruction encoding format. While it's great for hardware execution, for a VM it can be both simplified and made more compact. For example, I encode branches as: a byte with an opcode + a byte with two 4-bit registers packed + a varint with the absolute address to jump to. This is more space efficient, naturally supports fully sized immediates and doesn't require the crazy bit shuffling to extract them.
- The limited range of immediates. Due to the use of varints I support full length immediates, and I macro-fuse LUI + ADDI etc. in my linker into a single ADDI.
- The zero register. This is only a problem because I must support AMD64 as a target, and there just isn't a free register there where I could park a zero in. This can be special-cased in the VM, but it is kinda tricky and annoying (and a lot of these cases are never emitted by the compiler, so you have a bunch of complexity and effectively useless dead code). So what I do is that I disallow the zero register in the bytecode, and when postprocessing the code in my linker I either just replace those instructions (e.g. if there's a MUL which uses the zero register as one of the source registers I compute the result and put an immediate load there instead), get rid of them (e.g. if the instruction uses it as a destination then it's a NOP) or convert them into new instructions which take any immediate and not only a zero.
- The alignment requirements of code addresses. I don't load any of the code into guest visible memory (it's effectively a Harvard architecture VM), and I virtualize code addresses to not address instructions individually, but basic blocks (so e.g. an address of "4" points to the 1st basic block, "8" points to the 2nd, etc.). This has a couple of benefits: it reduces the attack surface (you can't ROP your way around anywhere, only to the start of basic blocks), is more compact to encode (since there are less basic blocks than instructions the addresses are smaller and you need less bits to encode them as varints), makes it possible to efficiently emit code for basic block based gas metering in a single-pass without having to analyze the control flow, etc. Unfortunately I still need to handle the case of "what if someone dynamically jumps to address 3", and there's no way to work around this issue in my linker, so at run time I have to create a jump table (for translating guest addresses into host addresses) that's 4 times bigger than necessary. This wastes memory and introduces extra cache misses. (I can later drop this down to 2 if I add support for the C extension, but that will still make it twice as big than necessary.)
---------
So my number one wish for RISC-V that'd make it nicer for VMs (and wouldn't be VM specific) would probably be an extension which supports full length immediates and forces the minimum alignment of code addresses to be 1.
Recently I wrote a simplistic RISC-V interpreter for educational purposes mostly, and a friend was benchmarking it against several scripting system in the context of game-character control routines.
Well, it's quite fast for what it is, though Lua is still somewhat faster, heh.
Very cool! One tip: Take the execute segment and produce fast bytecodes instead of interpreting each instruction as a bit pattern. Producing faster bytecodes is something that even WASM emulators do, despite it being a bytecode format.
Thanks! Yeah, this one is supposed to be a very simple implementation for a kind of "how to write a machine code interpreter" tutorial, but I was looking to experiment with some optimizations if time permits.
Patching with pre-baked alt-bytecode was one of the ideas indeed.
The fast bytecode is reduced to a simple operation that excludes certain knowns. For example if you have an instruction that stores A0 = A1 + 0, then knowing the immediate is zero, this can be reduced from reading a complex bit pattern to a bytecode that moves from one register to another, basically MV dst=A0, src=A1.
This was posted before and I still have no idea what the rationale is. No desktop PC or game console is RISC-V, so if I'm going to all the trouble to use a scripting solution that requires me to compile my scripts to machine language, why would I target RISC-V? Why wouldn't I just compile to x86-64 directly?
Like, what is the vision here, a LuaJIT that targets RISC-V, running inside my x86-64 game, where the emulator translates the RISC-V back into x86-64... for reasons? Just isolation?
Yes. When something's intellectually stimulating you work on it more, that's it.
Most game development is a grind. These huge distractions, like a scripting backend, well if you spend 100h working on the scripting engine only to spend 1h authoring actual scripts, you still spent 1h authoring scripts those 2 weeks instead of 0h. The game gets delivered sooner even if you spend 10x as long working on it.
It's a quintessential misunderstanding about indie game development. HN readers think Jonathan Blow is wasting his time writing a whole new programming language and engine, and that's why his games take 6 years to make. No: his games would take 20 years to make if they weren't intellectually engaging to make. They wouldn't be made at all!
It's the same energy as rewriting everything in another programming language.
I think this happens at giant companies too, all the time. So called Not Invented Here syndrome: it's as much about laundering open source code as it is about keeping things interesting enough to make the extreme boredom and grind worth it for otherwise smart and healthy people.
The sandbox is interpreting RISC-V, just very quickly. It's platform independent. I am actually making a game with a derivative of this repo, and in the server I am building the programs at the same time as the server. Once the server starts, it loads all the programs, and then sends compressed programs to each client, so that everyone who connects has the same scripts. Easy and convenient, but most likely just that because I did it from the start.
Man it's so frustrating as an author to explain you you didn't just use existing technology X in the linked article but then be met by a flood of HN comments from people asking "but why didn't you just use existing technology X?"
This is an interesting concept, getting the gameplay logic sent to the client this way. I guess it only works for simpler games so far, do you have any code examples?
The game is quite complex actually, but the script is not doing overmuch right now. The script is doing things that makes sense for a growing modding API, while the engine still does the brunt of the work. That said, there are functions that end up being called billions of times simply because you want that flexibility, and that's where the low latency script pays off.
It is indeed true that most VMs, including JavaScript, WebAssembly, the JVM, etc., have significant latency on the boundary. For example, many wasm VMs will install a signal handler, and that needs to be enabled/disabled on each entry/exit from the VM.
But the latency can be fixed. You can sandbox WebAssembly without a signal handler, in particular (at the cost of a few % throughput overhead for bounds checks). I'm not sure if the author benchmarked that, but it should be very fast.
(As for "why RISC-V" in the project, it looks like that's because it's easy to write an interpreter for, but it could have been any compiler target, it seems.)
I think there are a couple of reasons for picking RISC-V over others.
It is the youngest of the lot, meaning it is carrying around the least legacy compatibility stuff while still incorporating a lot of modern design lessons.
Additionally, modern ARM and MIPS are not free and (potentially?) patent encumbered as well.
You could go with some old, free MIPS or ARM, but then how good is performance, and how good is compiler support for these nowadays?
There is no reason I can think of why a Wasm interpreter would have higher overhead for arguments in and out. Maybe I'm missing something?
Wasm was designed to have very fast calls, both internally and externally. That's why the basic types correspond directly to machine types. The goal is to have 0 overhead, except for sandboxing.
Regarding sandboxing, there is the stack overflow check that happens in Wasm on each call. But you can disable that check if you don't want it.
Btw, you can see that Wasm has no extra overhead on calls in a simple way: compile Wasm using wasm2c and inspect the C output. That C is what a Wasm VM would run. Some details:
Note the section there about LTO: The Wasm can even be inlined into the code calling into it, which means there is zero call overhead. (But even without LTO, it's just another C function to call.)
That's definitely a cool project, but I'm only interested in interpreting right now. I am sharing programs with clients, as I wrote in another thread. Not interested in producing or linking native code. I already have a binary translation mode, and I could probably expand it to avoid the interpreter loop to save 2ns, but I just don't have a need. Right now, everything is perfect for my use case.
It's cool that you can inline sandboxed functions like that. I'm curious about how you arbitrarily select and call functions though? A string-lookup kind of thing?
The answer lies in the reason why you would use an embedded scripting language/environment at all, over just loading native code plugins – sandboxing, fault isolation and such.
RISC-V seems to be used here more as a bytecode format of sorts, and at least compared to x86_64 it should be much easier to implement (first of all because all of the opcodes have the same size.)
Using a sandboxed bytecode VM has merit, but I would naively expect wasm to be far better at it since RISC-V was designed for hardware and wasm was designed for exactly this usecase (sandboxed code running in another application with good performance and isolation).
(Although... I do see value in having a second option. I guess it might be worth seeing if it can actually be better, even though it's not the thing I would have written)
I mean sure, but virtualization is a very robust (and much lower overhead) technology that is readily available (and supported by the processor). If you just want isolation, why wouldn't you use that? You still have to ship a compiled binary after all.
No, as the author of TinyKVM, the overheads of entering and leaving KVM is not the same order of magnitude as SFI. TinyKVM requires 2-3 microseconds to enter and leave, which is the lowest I ever managed to get while it still being robust and dependable sandbox. For some, that might seem really low, but keep in mind that libriscv which is the backbone of RVScript has 3 nanoseconds overhead.
RISC-V is quite fast as far as whole-platform emulation goes, but yes using it for plugin code is a bit weird, WASM would seem to be preferable for that use case. The interfacing story of WASM is not quite perfect (you're limited to a C-like baseline API/ABI, since the WASM components standards is not finished yet) but it's not like RISC-V is any better.
> Lua, Luau and even LuaJIT have fairly substantial overheads when making function calls into the script, especially when many arguments are involved. The same is true for WebAssembly emulators that I have measured, eg. wasmtime.
At large scale, game engines have relatively unsatisfying answers to "what's the best way to iterate quickly on the game while supporting all the features we need".
If the scope is small, the answer is easy: write some naive code, incrementally profile. The project isn't big so full rebuild times can stay light, and your team isn't large so you really can "do whatever" and get somewhere, especially if you take the route of writing in a native language that compiles fast - a Pascal or something more "new and hip" like Beef.
But when you are asked to do it on a AAA project you end up with every imaginable kind of feature: it needs to support a team of hundreds, it needs to be fast on console hardware, it needs to be straightforward to debug, it needs to support modding, it needs to be fast to iterate on, it needs to be flexible about the memory layouts of save data or assets.
So you do end up in this kind of space where you're like, "we'll target a VM that is really low level, and that'll reduce the friction and granularity of switching between debug and release profiles, and that lets us stay in control of when we want to sandbox and when we want to go fast and we'll still be able to control every byte". That it happens to be RISC-V is not super relevant - it could be WASM or a custom bytecode like the Hashlink target in Haxe. It just needs to be a thing to compile to, that you can reasonably expect to implement and maintain.
It's not the only approach that could be taken. Downscoping the ambition by 50% and hardcoding a little more of your spec is a good way to get through the technical stuff 10x faster.
As someone who's working on a game engine that's designed to be highly data-driven, with only the core game engine itself in a systems programming language and a lot of calls out to scripts, this is very intriguing! C++20 doesn't seem like a good candidate for an actual scripting language, so I'd have to find a suitable scripting language that can compile to RISC-V, though, and of course requiring precompilation is an issue.
Yes, precompilation is not optional. I've used many languages in sandboxes over the years developing both this and other emulators.
I really like Nim the most. It's Python spiritually, but with types. Types just really really necessary, and it does have FFI support so you can forward arguments either from a wrapper or use {.cdecl} directly. It's still not a barneskirenn (as we say in Norway), but definitely the most fun I've had using another language in my various sandboxes.
Nelua and Zig are close seconds. Both are just so easy to work with C-based FFI. Nelua is probably not safe to use, as it's still a work in progress. At least last I checked. Zig is very much ready to use. Maybe not your cup of tea, though.
Golang has a complex run-time and I don't recommend using it. It's definitely possible though as my sandbox does have a full MMU. Just expect integration to be long and ardous. And the ABI changes potentially every version.
Rust is one of the easier ones. I didn't find it fun to work with though. It has by far the best inline assembly, but too much fighting with the compiler. I know that people love Rust, and it is fully supported in my emulator.
C/C++ has the benefit of having the ability to have their underlying functions overridden by native helper system calls. Eg. replacing memcpy() with a system call that has native performance. It's too long a long topic to talk about here, but Nim and Nelua also falls under this umbrella along with other languages that can compile to C/C++.
Kotlin is definitely possible to use. A bit hard to understand how the native stuff actually works, but I did manage to run a hello world program in several sandboxes with some effort. I'm not 100% sure but I think I managed to convert a C API header directly to something kotlin understands using a one-liner in the terminal. I would say it scores high just on that. Again, just a bit hard to understand how to talk to Kotlin from an external FFI looking in.
JavaScript is of course possible with both jitless v8 and QuickJS. v8 requires writing some C++ scaffolding + host API functions, and QuickJS requires the same in C. Either works.
And I think that's all the languages I have tried. If you think there's a missing language here, then I would love to try it!
This is a really interesting list, thank you for replying!
(the following are entirely undirected musings on using your technology for my game engine, mostly in case anyone finds them interesting or has something to suggest that I haven't thought of since I'm new at this)
At least for my use case C, Zig, Rust, and even probably Nim are out of the question because I'm not just using scripting for sandboxing capabilities and such, I'm also using it because I want people to be able to program games and write mods in a high-level language, so using a systems programming language as my scripting language kind of feels like it defeats part of the purpose and I might as well just use dynamic linking with a C ABI or something crazy like that (I'm new to this so excuse me if that's nonsense lol). JavaScript and Kotlin are intriguing though, because they aren't systems programming languages, so I'll have to think long and hard about those;
I've been considering C# for scripting lately (because I like it well enough, it's widespread, and known in the gaming world) and I wonder if you can get natively aot compiled Kotlin to work, if you could get natively aot compiled C# to work too... there would probably be similar complications due to the large and complex runtime, but I know you can strip it down so I wonder how that might play in. I also wonder what the performance would be like compared to just hosting the dotnet runtime, which is what I was intending to do before I saw your post. Maybe at some point I should set up a benchmark to compare! Although I have to say the pre-compilation thing is a bit of a deal-breaker for me, since I really want people to be able to put plain text things directly in the game folder and see those changes in the engine without having to do any kind of build process, which is something that is achievable with the regular dotnet runtime using a precompiled assembly bootstraps the rest of the sctipts using dynamic assembly compilation and loading to collect everything else in the script directory.
Hm, if you don't actually need a sandbox then I think just using the C# run-time makes a lot of sense. I also like C#, but I've never been in a situation to try the fairly new AOT support. Sounds like a good idea, though. C# is a very good language.
Do you mean compiling the .NET runtime (CoreCLR) itself or AOT compiling C# code? Because I recently tried AOT compiled C# and that builds quickly. And you don't generally need to compile the .NET runtime when embedding it because you can just load the builds they ship.
Simply running a DotNET dll. I just tried a basic "Hello World" program compiled to a CIL .dll and to RV64 machine code, on my x86 Linux machine (original ThreadRipper 2990WX) and then running it through different JIT/interpreters. Wall time:
DotNET: 0.061s
qemu-riscv64: 0.006s
Spike: 0.031s
Qemu is JIT with Linux syscall layer built in (in native code).
Spike is a RISC-V interpreter with Linux syscall layer provided by interpreted RISC-V "pk"
So the DotNET JIT has quite a high overhead for startup, or one-time code.
DotNET beats emulated RISC-V here, but the RISC-V emulator (with RISC-V code compiled with -O1) is pretty much as fast as a lazy person compiling C to native x86 gets.
Good to know. That solidifies my plan to embed the dotnet runtime in my engine and load JIT code, not compile C# to AOT and then run that through an emulation layer
Well, RISC-V is gaining momentum, and transition apparatus from x86_64 is taking shaped from many individuals wishing risc-v to be a success.
BTW, if anybody knows about milk-v duo (the one with the SOC free from arm cores) resellers in Europe I can pay without a credit card and I can contact with a self-hosted email... thank you.
I did remove elf period, I use a 'json/wayland' like format, namely excrutiatingly simple for modern CPUs. Additionnaly, there is no rust anything here, or any other compiler based language, it is assembly, x86_64 and risc-v assembly (I even use a x86_64 intel syntax which does assemble on 4 assemblers).
I had the time to start to think how to do that, I was hit hard by the LR/SC emulation/interpretation issue. Those LR/SC pairs are extremely dangerous and can livelock easily, and understanding the theorical puzzle of the specs which seems to try to fix that is meh (it must be extremely rigorous on the theorical prerequisites for this very, as this is so much critical and does not seem to be): "the eventual success of store-conditional instructions". The theorical puzzle I am trying to figure out seems to miss a rigorous definition of the "failing of store-conditional instructions".
In the end, it seems proper emulation on x86_64 will be very slow as each store/load will have to go thru some expensive code in order to deal with lr/sc. But that's fine, as I am more interested in running correctly risc-v code than omega quickly (intel and amd have nastily failed at providing lr/sc semantics hardware instructions to x86_64, see the Intel TSX debacle). I may implement a full virtual memory controller, significant performance penalty will happen for the sake of correctness.
When writing a web assembly interpreter I was surprised at how it didn't map cleanly to the underlying hardware. The interpreter had to keep track of the types of everything. I can definitely see how a RISC-V based solution would be faster.
I’m curious what you’re talking about—WASM has different instructions for each of its different types. There are separate instructions for i32, i64, f32, and f64 in the base spec.
This seems like a pretty clean map to me—most of the architectures these days have native support for 32-bit and 64-bit types these days. You only get 16-bit and 8-bit support with SIMD or load/store, much like how WASM does it.
He might be talking about having to implement a register allocator/file in order to interpret stack machines faster. Something you get from the compilers on register architectures like RISC-V, ARM and x86.
wasmtime uses cranelift, which has a complex register allocator. wasm3 uses a simple register file.
> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach.
Only for binary translation or just-in-time compilation (basically producing native code). If you're interpreting RISC-V you can just pretend the registers are an array of register-sized integers. Which is what they are.
1. Check the type at the top of the stack to make sure it's i32 and trap otherwise
2. Pop the top of the stack
3. Check the top of the stack and make sure it's i32 and trap otherwise
4. Pop the top of the stack.
5. Add the two values together
6. Push the result to the top of the stack along with its type.
Compare this to ADD from RISC-V where you can just add the contents of 2 source registers and store it in a destination register. There is not a bunch of type book keeping or alignment that you have to worry about.
Most of this can be done statically. You do the bookkeeping once. There are various approaches—you can try to allocate registers, or you can emit code that loads/stores on the stack, or you can just run an interpreter and remove the safety checks from it.
The stack isn’t dynamically typed, so it doesn’t make sense to trap.
Looking into it further the type checking is supposed to be done in a dedicated validation step and other than validation using instructions with the wrong type of data seems to work.
WASM is a stack-machine, so the input code needs to be validated to make sure that the same types gets pop'ed as has previously been push'ed. But that can be done ahead of execution.
You'd need to validate function types during runtime though, which is something that a CPU emulator does not need to.
WASM is kind of a stack machine, yes, but there are a lot of constraints on the stack machine that are designed to make it easy to translate to registers.
If you squint and look sideways, you can think of it as a flattened tree, rather than a stack machine.
I'm not familiar with the details, and don't have the references to hand, but WASM has been designed in a way that reflects the design choices of Chrome's Javascript engine rather than optimizing standalone compilation.
Part of optimizing for the design choices of browser engines is to make it easy to emit machine code. The browsers have multiple compilers in them, and one of the compilers has the task of emitting simple machine code quickly, so the code can run immediately (reducing start-up time). In V8, this initial compiler is called Ignition.
Long time ago I've used paxCompiler scripting engine. It was made for Delphi / FreePascal and supported Delphi input language. Rather than being byte code interpreter / JIT it actually generated native code in RAM hence the overhead of calling it was the same as calling regular function. It had lots of other super nice features but that is all in the past anyways.
I am curious why such approach is not used for other scripting projects.
> Rather than being byte code interpreter / JIT it actually generated native code in RAM hence the overhead of calling it was the same as calling regular function.
I'm not sure what you mean. What you've just described is a JIT compiler.
Not in a way V8 does it for example. I would call it AOT (ahead of time). Normally you first compile and initialize scripts and then you can call into their functions and the other way around. But sure you can also compile and execute function in one step so be my guest and call it whatever you want.
Is there a benchmark available where you compared Lua and wasmtime against your interpreter? It would be interesting for Lua and wasmtime developers to chime in on why the world switch overhead is higher and whether there exist configuration settings or APIs that eliminate the overhead for your use case.
I'm curious if there is any fundamental reason why they have to be slower and I suspect the answer is "no".
You're probably right that things can be improved. But my implementation has the aforementioned latencies right now.
For the benchmarks I did my best writing good Lua, but I am not a pro. Call overhead is also subtracted out of every benchmark after call overhead is measured. So that extra overhead is a permanent feature of these implementations. I also noticed that most of my script functions have at least ~3-4 arguments, which seems like it would be quite costly. Just look at the 8x arg benchmark and divide by 2, I guess. Another 100ns right there. Baseline 150ns on a micro benchmark is going to become at least a microsecond in prod when everything is random and semi-cold.
I’ve thought of doing something similar myself. Some of the advantages as I see them are language agnosticism, simple and easy to read script bitcode making a debug interface easier to write and maintain. By having the script use the same language as the rest of the game, moving functions between static and dynamic compilation can help with iteration and performance.
No, it says that that latency of calling into the VM can be lower in a simple interpreter compared to an optimizing VM (which might need to do things on the call boundary).
I'm working on something similar, but instead of optimizing for low call latency I'm prioritizing security, execution performance and compilation speed. I'm currently at roughly the same execution speed as wasmtime, but with over 200x faster compilation, and with significantly better security sandboxing.
RISC-V (well, a slightly modified variant) actually makes for a really good VM bytecode!