Programming, philosophy, pedaling.

LLVM internals, part 1: the bitcode format

Jul 19, 2021     Tags: llvm, rust     Series: llvm-internals

This post is at least a year old.


I’ve done a couple of posts on LLVM itself, mostly on things you can do with LLVM or how LLVM represents particular program features.

I’ve received some good feedback on these, but I’d like to focus a sub-series of posts on LLVM’s implementation itself: the file formats, parsing strategies, and algorithmic choices underlying LLVM’s public interfaces (APIs, CLIs, and consumable output files). I’ll be writing these posts as I work on a series of pure-Rust libraries for ingesting LLVM’s intermediate representation, with the end goal of being able to perform read-only analyses of programs compiled to LLVM IR in pure Rust.

For those who don’t know what LLVM is, this post has a broader background on LLVM’s components and intermediate representation.

Quis repraesentābit ipsos repraesentātiōnēs?1

LLVM’s intermediate representation is not just an in-memory series of changes to a program as it passes through compilation phases: it’s a stable model of the program that can be saved to disk and subsequently revivified for analyses that run in entirely new invocations of clang or other LLVM tools2.

So: how does LLVM faithfully preserve its intermediate representation on disk?

The answer is twofold: LLVM provides both a textual form that looks roughly like a C-family programming language, and a binary form (bitcode, in LLVM’s parlance) that’s denser and (ostensibly) faster to parse. The two forms are semantically identical, and trivial to produce and convert between:

# create foo.bc for foo.c
$ clang -emit-llvm -c foo.c
$ file foo.bc
foo.bc: LLVM IR bitcode

# convert foo.bc into foo.ll (human readable IR)
$ llvm-dis foo.bc
$ file foo.ll
foo.ll: ASCII text, with very long lines

# create foo.ll directly from foo.c
$ clang -S -emit-llvm -c foo.c

The textual form of LLVM’s IR is fun to read, but it’s “just” text: slow and (perhaps) annoying to parse with a machine, but easy for in-the-know humans to comprehend.

By contrast the bitcode form is remarkably inscrutable, even by binary format standards: apart from a few printable strings, there are no obvious fixed-size or length-prefixed structures. In exchange for this inscrutability, it performs admirably at storing LLVM’s intermediate representation in a space-efficient manner. Compare the representations of SQLite’s 3.36.0 amalgamated release:

sqlite3.c sqlite3.ll sqlite3.bc
8.0MB 14.0MB 1.8MB

An order of magnitude smaller isn’t bad34. Let’s find out what we need to do to parse the bitcode format for ourselves.

Parsing LLVM’s bitcode

LLVM is a well-documented project, and the bitcode format is no exception. To summarize:

This all sounds very nicely structured, which begs the question: why is there no visible structure in an bitcode file (when viewed with a hex editor)?

The answer is in the bit of bitcode and bitstream: blocks and records are not aligned on bytes (or larger boundaries like words); they’re encoded as variable-length bitfields. These bitfields can share a byte or span across multiple bytes, giving bitcode files remarkably high entropy when compared to binary objects (visualized with binvis.io):

sqlite3.bc sqlite3.o
An entropy visualization for sqlite3.bc An entropy visualization for sqlite3.o
1.8MB 1.3MB

Both blocks and data records are identified by abbreviation IDs, usually shortened to “abbrev IDs” in the LLVM documentation. In true self-specifying fashion, the majority of abbreviation IDs used by a bitcode file are not specified by the container format — they’re defined within the bitstream itself by the appearance of DEFINE_ABBREV records5. Indeed, the container format only specifies 4 extremely basic abbreviation IDs: two for managing block scope (ENTER_SUBBLOCK and END_BLOCK), DEFINE_ABBREV itself, and UNABBREV_RECORD as a general-purpose inefficient “escape hatch” encoding for records6.

These abbreviation IDs (and the abbreviation formats they reference, more on that soon) are tightly scoped to the block that introduces them7, with one exception: bitstream blocks that describe an LLVM module may themselves contain a BLOCKINFO block, which in turn can define multiple abbreviations for any subsequent blocks that match the block IDs specified by the BLOCKINFO.

If all of this indirection and abbreviation wasn’t complicated enough, the bitstream container format uses three distinct techniques for determining the length of records and their constituent bitfields:

Abbreviations and BLOCKINFO

As mentioned above: every abbreviation has a corresponding abbreviation definition, the location of which in the stream determines its corresponding abbreviation ID.

In other words: abbreviation IDs are never explicitly defined anywhere. They’re implicitly defined as part of the order of appearance of DEFINE_ABBREV records that apply to the current block scope.

So, how do we determine which abbreviations apply to the current block scope? We apply three rules:

  1. Regardless of our current scope, abbreviations 0 through 3 are defined by the bitstream container itself: they’re END_BLOCK, ENTER_SUBBLOCK, DEFINE_ABBREV, and UNABBREV_RECORD, as mentioned above.
  2. Next, starting with an initial counter state of 4, we check the special BLOCKINFO block. The BLOCKINFO (essentially) contains a mapping of block IDs to abbreviation records — we assign each of the records that matches our current scope’s block ID a sequential abbreviation ID.
  3. Finally, we count all DEFINE_ABBREV records present in our current scope. These also receive sequential abbreviation IDs.

These rules can be difficult to wrap your head around in the abstract, so consider the following symbolic representation of a LLVM IR module in the bitstream container format (I’ve given the DEFINE_ABBREVs symbolic names for comprehension):


Here, we have three potential block scopes (the top-level MODULE_BLOCK, with a nested BLOCKINFO and FUNCTION_BLOCK). The non-default abbreviation IDs for each are:

Observe that there’s some subtlety here: the MODULE_BLOCK is specified as the top-level block and the BLOCKINFO is nested within it, but the BLOCKINFO can (and does in this case) contain abbreviation definitions that we must interpret in order to correctly parse the MODULE_BLOCK. As such, we potentially need to jump forwards in the bitstream to interpret BLOCKINFO before we can correctly interpret our actual starting point10.

I just want to grill interpret IR!

All this text, and we’re no visibly closer to actually getting IR semantics out of a bitcode file generated by LLVM. Whoops.

Everything here is important background for interpreting the bitstream container, but getting actual IR semantics out of is an entirely different, separate beast. I intend to dedicate a subsequent post (give it about a month) to just that: taking the primitives here and hacking using them to build a Rust library that can interpret bitcode modules11. That crate, in turn, will become the substrate of a higher-level crate that exposes pleasant APIs for actually analyzing LLVM IR. All without needing to build or link to your system’s build of LLVM!

The first step of this (long) process is already done: I’m open-sourcing an initial version of llvm-bitcursor, which will provide a bit-granular cursor over a bitstream container. It already has support for VBRn decoding, too.

  1. With apologies to Juvenal

  2. The easy example of this is LLVM’s implementation of LTO: the “completeness” (w.r.t. information) of any re-ingested LLVM IR is necessary for the correctness of any link-time optimizations. 

  3. Interestingly, applying naive compression (just gzip) narrows the gap significantly: sqlite3.ll.gz is only 1.9MB, while sqlite3.bc.gz is 1.2MB. There are probably reasons why LLVM’s authors went with a relatively complex bitcode format instead of a simpler one (which would also compress even better than the current format), but I’m not privy to them. 

  4. I’ve saved the commands used here. clang is clang-10, version 10.0.0-4ubuntu1

  5. This is where LLVM’s bitstream container begins to look a little bit like BER, which is not a pleasant similarity. 

  6. More on each of these (and all of the actual IR-specific ones) in a subsequent post. 

  7. Meaning that neither the parent nor child blocks of a particular block can use that abbreviation. 

  8. One of many places where it’s appropriate to think of an LLVM bitstream interpreter as a cute little stack machine. 

  9. To make this intuitive: if you have a space of values {0..15} with a distribution bias towards the range {0..6}, then you expect the vast majority of your values to fit in a single VBR3 chunk. The ones that don’t fit should be a minority and their slightly larger representation (two chunks instead of one) is negligible. 

  10. From my understanding, anyways. The docs make it explicit that MODULE_BLOCK is the top-level block when interpreting a single LLVM IR module, and that it can contain a BLOCKINFO. There seems to be no restriction on BLOCKINFO containing SETBID MODULE_BLOCK (#8) records which (according to the specification) should apply unconditionally to any matching blocks, including the parent. 

  11. The library is going to take longer than the blog post. 

Discussions: Reddit