At Trail of Bits, we are creating a new compiler front/middle end for Clang called VAST [1]. It consumes Clang ASTs and creates a high-level, information-rich MLIR dialect. Then, we progressively lower it through various other dialects, eventually down to the LLVM dialect in MLIR, which can be translated directly to LLVM IR.
Our goals with this pipeline are to enable static analyses that can choose the right abstraction level(s) for their goals, and using provenance, cross abstraction levels to relate results back to source code.
Neither Clang ASTs nor LLVM IR alone meet our needs for static analysis. Clang ASTs are too verbose and lack explicit representations for implicit behaviours in C++. LLVM IR isn't really "one IR," it's a two IRs (LLVM proper, and metadata), where LLVM proper is an unspecified family of dialects (-O0, -O1, -O2, -O3, then all the arch-specific stuff). LLVM IR also isn't easy to relate to source, even in the presence of maximal debug information. The Clang codegen process does ABI-specific lowering takes high-level types/values and transforms them to be more amenable to storing in target-cpu locations (e.g. registers). This actively works against relating information across levels; something that we want to solve with intermediate MLIR dialects.
Beyond our static analysis goals, I think an MLIR-based setup will be a key enabler of library-aware compiler optimizations. Right now, library-aware optimizations are challenging because Clang ASTs are hard to mutate, and by the time things are in LLVM IR, the abstraction boundaries provided by libraries are broken down by optimizations (e.g. inlining, specialization, folding), forcing optimization passes to reckon with the mechanics of how libraries are implemented.
We're very excited about MLIR, and we're pushing full steam ahead with VAST. MLIR is a technology that we can use to fix a lot of issues in Clang/LLVM that hinder really good static analysis.
These are great questions! We think our approaches are complimentary. We think that CIR, or ClangIR, can be a codegen target from a combination of our high-level IR and our medium-level IR dialects.
Our understanding of ClangIR is that it has side-stepped the problem of trying to relate/map high-level values/types to low-level values/types -- a process which the Clang codegen brings about when generating LLVM IR. We care about explicitly representing this mapping so that there is data flow from a low-level (e.g. LLVM) representation all the way back up to a high-level. There's a lot of value and implied semantics in high-level representations that is lost to Clang's codegen, and thus to the Clang IR codegen. The distinction between `time_t` and `int` is an example of this. We would like to be able to see an `i32` in LLVM and follow it back to a `time_t` in our high-level dialect. This is not a problem that ClangIR sets out to solve. Thus, ClangIR is too low level to achieve some of our goals, but it is also at the right level to achieve some of our other goals.
> LLVM dialect in MLIR, which can be translated directly to MLIR
Should be:
LLVM dialect in MLIR, which can be translated directly to LLVM IR
Otherwise, great project! We're also using MLIR internally and it's been awesome, game-changing even when considering how much can be accomplished with a reasonable amount of effort.
I think the next big problems for MLIR to address are things like: metadata/location maintenance when integrating with third-party dialects and transformations. With LLVM optimizations, getting the optimization right has always seemed like the top priority, and then maybe getting metadata propagation working came a distant second.
I think the opportunity with MLIR is that metadata/location info can be the old nodes or other dialects. In our work, we want a tower/progression of IRs, and we want them simultaneously in memory, all living together. You could think of the debug metadata for a lower level dialect being the higher level dialect. This is why I sometimes think about LLVM IR as really being two IRs: LLVM "code" and metadata nodes. Metadata nodes in LLVM IR can represent arbitrary structures, but lack concrete checks/balances. MLIR fixes this by unifying the representations, bringing in structure while retaining flexibility.
Funny that there was no mention of GCC, since it was probably one of the first IRs that anyone encountered IRL. If I remember correctly one motivation for Clang/LLVM was because GCC's IR was so bad.
I knew people that wrote backends for gcc, and they pretty much all agreed it was a nightmare.
I think advanced fortran compilers already used IRs by late 60s, early 70s.
Most compilers in the 80s already used some form of non-ast IRs: Generalised portable instructions, cfg representations, etc. I'd say that back then people were experimenting with various forms of IRs for optimization purposes.
By 90s it was typical for compilers to have several IRs. Gcc belongs to this compiler generation.
2000s is the time when IRs standardised around 2-3 SSA-like forms, which made it possible to do most optimisation work on this level only.
TBH, I just speculated. I remember 60s ibm research publications on cfg, optimizations against it and was pretty sure these things are very hard to do on an ast or those simpler line by line compilers popular back then.
In fact, even the original fortran had quite an involved compiler, complete with numerous optimisations and sophisticated register allocation.
I have an irrational dislike of SPIR-V. On the flip side of the coin I think MLIR is a work of genius — especially as a springboard for ideas in developing custom IR.
The open source vs proprietary decision is usually a decision taken based on exactly this difference.
The sole reason for many companies to upstream some of their stuff is that they do not have to keep up with the firehose of upstream changes on their own.
Some companies might be in the position to make that choice, but I'd guess most aren't really equipped for it. First of all there's the issue of getting your changes upstream, you have to convince the community it's worth their time to help maintain your code, plus a quality bar you have to meet. And then once it's upstream you're committed to drinking from the firehose, instead of updating at your own pace, and there's extra scrutiny on every change you make.
I really think most of these proprietary users are just figuring things out as they go, and doing that upstream 1. wouldn't be accepted, and 2. wouldn't make sense for them in the first place. But I'll admit that's speculation, I don't have intimate knowledge of what everyone's doing :)
I … don’t know? Originally, it just felt like some janky subset of LLVM-IR: close enough to almost work with the standard tooling, right out of the box, but not quite. It’s been nearly 10 years since I’ve looked at it, though, so what do I know? I don’t even know if they kept the design I was shown.
SPIR-V is a lot more like minified JavaScript or JVM bytecode than it is like x86 code, at least. In some cases the games' shaders haven't had any optimisation done and you can more or less fully reconstruct the original GLSL, albeit without comments or identifiers.
Also a lot of textual shaders were already unreadable generated code. But for the ones that weren't, it's definitely a loss for those of us trying to peak at the internals of games, indeed :)
MLIR sound interesting, but look like is totally dependant's on C++?
One of the problems of the article is how many compiler infra is re-made. But a big reason is that if you are doing Python, for example, you don't wanna touch C++.
In fact, nobody wanna touch C++, except(maybe!) C++ devs :)
Our goals with this pipeline are to enable static analyses that can choose the right abstraction level(s) for their goals, and using provenance, cross abstraction levels to relate results back to source code.
Neither Clang ASTs nor LLVM IR alone meet our needs for static analysis. Clang ASTs are too verbose and lack explicit representations for implicit behaviours in C++. LLVM IR isn't really "one IR," it's a two IRs (LLVM proper, and metadata), where LLVM proper is an unspecified family of dialects (-O0, -O1, -O2, -O3, then all the arch-specific stuff). LLVM IR also isn't easy to relate to source, even in the presence of maximal debug information. The Clang codegen process does ABI-specific lowering takes high-level types/values and transforms them to be more amenable to storing in target-cpu locations (e.g. registers). This actively works against relating information across levels; something that we want to solve with intermediate MLIR dialects.
Beyond our static analysis goals, I think an MLIR-based setup will be a key enabler of library-aware compiler optimizations. Right now, library-aware optimizations are challenging because Clang ASTs are hard to mutate, and by the time things are in LLVM IR, the abstraction boundaries provided by libraries are broken down by optimizations (e.g. inlining, specialization, folding), forcing optimization passes to reckon with the mechanics of how libraries are implemented.
We're very excited about MLIR, and we're pushing full steam ahead with VAST. MLIR is a technology that we can use to fix a lot of issues in Clang/LLVM that hinder really good static analysis.
[1] https://github.com/trailofbits/vast