Compilers and IRs: LLVM IR, SPIR-V, and MLIR

k4st · on Oct 29, 2022

At Trail of Bits, we are creating a new compiler front/middle end for Clang called VAST [1]. It consumes Clang ASTs and creates a high-level, information-rich MLIR dialect. Then, we progressively lower it through various other dialects, eventually down to the LLVM dialect in MLIR, which can be translated directly to LLVM IR.

Our goals with this pipeline are to enable static analyses that can choose the right abstraction level(s) for their goals, and using provenance, cross abstraction levels to relate results back to source code.

Neither Clang ASTs nor LLVM IR alone meet our needs for static analysis. Clang ASTs are too verbose and lack explicit representations for implicit behaviours in C++. LLVM IR isn't really "one IR," it's a two IRs (LLVM proper, and metadata), where LLVM proper is an unspecified family of dialects (-O0, -O1, -O2, -O3, then all the arch-specific stuff). LLVM IR also isn't easy to relate to source, even in the presence of maximal debug information. The Clang codegen process does ABI-specific lowering takes high-level types/values and transforms them to be more amenable to storing in target-cpu locations (e.g. registers). This actively works against relating information across levels; something that we want to solve with intermediate MLIR dialects.

Beyond our static analysis goals, I think an MLIR-based setup will be a key enabler of library-aware compiler optimizations. Right now, library-aware optimizations are challenging because Clang ASTs are hard to mutate, and by the time things are in LLVM IR, the abstraction boundaries provided by libraries are broken down by optimizations (e.g. inlining, specialization, folding), forcing optimization passes to reckon with the mechanics of how libraries are implemented.

We're very excited about MLIR, and we're pushing full steam ahead with VAST. MLIR is a technology that we can use to fix a lot of issues in Clang/LLVM that hinder really good static analysis.

[1] https://github.com/trailofbits/vast

CalChris · on Oct 30, 2022

How is this different from Facebook's CIR?

https://www.phoronix.com/news/Meta-Developing-Clang-IR-CIR

boundchecked · on Oct 30, 2022

I'm also curious, especially they seem to be gearing towards analysis as well.

https://github.com/llvm/clangir/blob/main/clang/lib/CIR/Dial...

pag · on Oct 30, 2022

These are great questions! We think our approaches are complimentary. We think that CIR, or ClangIR, can be a codegen target from a combination of our high-level IR and our medium-level IR dialects.

Our understanding of ClangIR is that it has side-stepped the problem of trying to relate/map high-level values/types to low-level values/types -- a process which the Clang codegen brings about when generating LLVM IR. We care about explicitly representing this mapping so that there is data flow from a low-level (e.g. LLVM) representation all the way back up to a high-level. There's a lot of value and implied semantics in high-level representations that is lost to Clang's codegen, and thus to the Clang IR codegen. The distinction between `time_t` and `int` is an example of this. We would like to be able to see an `i32` in LLVM and follow it back to a `time_t` in our high-level dialect. This is not a problem that ClangIR sets out to solve. Thus, ClangIR is too low level to achieve some of our goals, but it is also at the right level to achieve some of our other goals.

erichocean · on Oct 29, 2022

> LLVM dialect in MLIR, which can be translated directly to MLIR

Should be:

LLVM dialect in MLIR, which can be translated directly to LLVM IR

Otherwise, great project! We're also using MLIR internally and it's been awesome, game-changing even when considering how much can be accomplished with a reasonable amount of effort.

k4st · on Oct 29, 2022

Typo fixed! Thanks :-)

I think the next big problems for MLIR to address are things like: metadata/location maintenance when integrating with third-party dialects and transformations. With LLVM optimizations, getting the optimization right has always seemed like the top priority, and then maybe getting metadata propagation working came a distant second.

I think the opportunity with MLIR is that metadata/location info can be the old nodes or other dialects. In our work, we want a tower/progression of IRs, and we want them simultaneously in memory, all living together. You could think of the debug metadata for a lower level dialect being the higher level dialect. This is why I sometimes think about LLVM IR as really being two IRs: LLVM "code" and metadata nodes. Metadata nodes in LLVM IR can represent arbitrary structures, but lack concrete checks/balances. MLIR fixes this by unifying the representations, bringing in structure while retaining flexibility.

wumpus · on Oct 30, 2022

Is reusing tool names in the same sub-field of Computer Science becoming a thing? VAST was a Fortran preprocessor.

wumpus · on Nov 2, 2022

Good timing: just got an invite to attend VAST's breakfast discussion at the SC (supercomputing) conference.

HN: two net downvotes for asking a question.

manv1 · on Oct 29, 2022

Funny that there was no mention of GCC, since it was probably one of the first IRs that anyone encountered IRL. If I remember correctly one motivation for Clang/LLVM was because GCC's IR was so bad.

I knew people that wrote backends for gcc, and they pretty much all agreed it was a nightmare.

pjmlp · on Oct 30, 2022

One of the first IRs was the PL.8 compiler, and the Amsterdam Compiler Toolkit, both predate GCC for quite a while.

vkazanov · on Oct 30, 2022

I think advanced fortran compilers already used IRs by late 60s, early 70s.

Most compilers in the 80s already used some form of non-ast IRs: Generalised portable instructions, cfg representations, etc. I'd say that back then people were experimenting with various forms of IRs for optimization purposes.

By 90s it was typical for compilers to have several IRs. Gcc belongs to this compiler generation.

2000s is the time when IRs standardised around 2-3 SSA-like forms, which made it possible to do most optimisation work on this level only.

pjmlp · on Oct 30, 2022

I got curious by your remark and found this Fortran compiler from 1962, using an IR that they call BAS (Binary and Arbitrarily Symbolic).

http://www.chilton-computing.org.uk/acl/applications/hartran...

vkazanov · on Oct 30, 2022

TBH, I just speculated. I remember 60s ibm research publications on cfg, optimizations against it and was pretty sure these things are very hard to do on an ast or those simpler line by line compilers popular back then.

In fact, even the original fortran had quite an involved compiler, complete with numerous optimisations and sophisticated register allocation.

Fun read, btw.

thechao · on Oct 29, 2022

I have an irrational dislike of SPIR-V. On the flip side of the coin I think MLIR is a work of genius — especially as a springboard for ideas in developing custom IR.

fooker · on Oct 29, 2022

MLIR is also siloing compiler research and development.

Every one and their mother has their own proprietary MLIR dialect nowadays, and the era of competitive open source compilers is sort of fading.

trogdc · on Oct 30, 2022

They probably would have their own proprietary "something" anyways, if not MLIR.

fooker · on Oct 30, 2022

MLIR makes this significantly easier. You no longer have to maintain a whole compiler!

trogdc · on Oct 30, 2022

Sure, but I think this mostly results in a difference in quality/velocity, not proprietary vs open.

fooker · on Oct 30, 2022

That does not seem correct.

The open source vs proprietary decision is usually a decision taken based on exactly this difference.

The sole reason for many companies to upstream some of their stuff is that they do not have to keep up with the firehose of upstream changes on their own.

trogdc · on Oct 30, 2022

Some companies might be in the position to make that choice, but I'd guess most aren't really equipped for it. First of all there's the issue of getting your changes upstream, you have to convince the community it's worth their time to help maintain your code, plus a quality bar you have to meet. And then once it's upstream you're committed to drinking from the firehose, instead of updating at your own pace, and there's extra scrutiny on every change you make.

I really think most of these proprietary users are just figuring things out as they go, and doing that upstream 1. wouldn't be accepted, and 2. wouldn't make sense for them in the first place. But I'll admit that's speculation, I don't have intimate knowledge of what everyone's doing :)

delta_p_delta_x · on Oct 30, 2022

> I have an irrational dislike of SPIR-V

Out of curiosity, why?

thechao · on Oct 30, 2022

I … don’t know? Originally, it just felt like some janky subset of LLVM-IR: close enough to almost work with the standard tooling, right out of the box, but not quite. It’s been nearly 10 years since I’ve looked at it, though, so what do I know? I don’t even know if they kept the design I was shown.

imtringued · on Oct 30, 2022

Games ship with source based shaders. SPIR-V basically means games ship binary shaders.

TazeTSchnitzel · on Oct 30, 2022

SPIR-V is a lot more like minified JavaScript or JVM bytecode than it is like x86 code, at least. In some cases the games' shaders haven't had any optimisation done and you can more or less fully reconstruct the original GLSL, albeit without comments or identifiers.

Also a lot of textual shaders were already unreadable generated code. But for the ones that weren't, it's definitely a loss for those of us trying to peak at the internals of games, indeed :)

mamcx · on Oct 30, 2022

MLIR sound interesting, but look like is totally dependant's on C++?

One of the problems of the article is how many compiler infra is re-made. But a big reason is that if you are doing Python, for example, you don't wanna touch C++.

In fact, nobody wanna touch C++, except(maybe!) C++ devs :)

pjmlp · on Oct 30, 2022

One of the first customers of MLIR is the Fortran compiler for LLVM.

I only do Python when doing systems programming in UNIX that outgrow bash scripts, or scritping software that uses Python as their scripting language.

Then again, I enjoy using C++ despite its flaws. :)

TazeTSchnitzel · on Oct 30, 2022

LLVM IR is usable from any language with a C ABI, Python included. There are bindings for various languages built on its C interface, e.g. OCaml.

mamcx · on Oct 30, 2022

I know about LLVM, but MLIR? Is the same?

And for the look of it, look like is not something that is viable to do without using C++?

slooonz · on Oct 30, 2022

Shower thought: we need a SQLIR, to be able to communicate with databases on a lower level that SQL. Many ORMs would benefit for that.