ENOSUCHBLOG

Programming, philosophy, pedaling.


LLVM internals, part 2: parsing the bitstream

Aug 10, 2021     Tags: llvm, rust     Series: llvm-internals

This post is at least a year old.

Preword

Previously: LLVM internals, part 1.

In the last post, I performed a high-level overview of LLVM’s bitcode format (and underlying bitstream container representation). The end result of that post was a release announcement for llvm-bitcursor, which provides the critical first abstraction(s) for a pure-Rust bitcode parser.

This post will be a more concrete walkthrough of the process of parsing actual LLVM bitstreams, motivated by another release announcement: llvm-bitstream.

Put together, the llvm-bitcursor and llvm-bitstream crates get us two thirds-ish of the way to a pure-Rust parser for LLVM IR. The only remaining major component is a “mapper” from the block and record representations in the bitstream to actual IR-level representations (corresponding to llvm::Module, llvm::Function, &c in the official C++ API).

Recap: the bitstream format

To quickly recap: LLVM’s bitstream format has two kinds of entities: blocks and records. Blocks provide a basic amount of state required to parse their constituent records (and sub-blocks); records are self-contained sequences of fields. At the bitstream level, the fields of a record are just unsigned integers1.

Every of block is identified by a “block ID” that’s globally unique (i.e., globally defined for the kind of bitstream being parsed), while every record is identified by a “record code” that’s only unique to the kind of block that the record is found in2.

Neither blocks nor records are byte-aligned (hence the “bit” in bitstream), and neither is guaranteed to have a consistent size within a given bitstream: the emitter is allowed to vary its use of variable-bit integers for compression purposes, and consumers are expected to handle this.

Every entity in a bitstream (records and blocks!) has an abbreviation ID, which tells the parser exactly how it should go about parsing the entity that immediately follows the ID. The abbreviation ID falls into one of two groups (reserved or defined), which can be modeled with the following pretty Rust enums:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/// Abbreviation IDs that are reserved by LLVM.
#[repr(u64)]
pub enum ReservedAbbrevId {
    /// Identifies an `END_BLOCK` record.
    EndBlock = 0,
    /// Identifies an `ENTER_SUBBLOCK` record.
    EnterSubBlock,
    /// Identifies a `DEFINE_ABBREV` record.
    DefineAbbrev,
    /// Identifies an `UNABBREV_RECORD` record.
    UnabbrevRecord,
}

/// An abbreviation ID, whether reserved or defined by the stream itself.
pub enum AbbrevId {
    /// A reserved abbreviation ID.
    Reserved(ReservedAbbrevId),
    /// An abbreviation ID that's been defined within the stream.
    Defined(u64),
}

Reserved abbreviation IDs can be thought of as “bootstrapping” markers — they provide the minimal amount of structure required for scoping (ENTER_SUBBLOCK and END_BLOCK) and record definition (DEFINE_ABBREV and UNABBREV_RECORD). DEFINE_ABBREV in particular is where the “bootstrapping” analogy shines through: it signals an “abbreviation definition” record that results in a defined abbreviation ID for a particular scope. More on record formatting and scoping rules in a moment.

Parsing a bitstream

To successfully and correctly parse an LLVM bitstream, we need to handle three discrete aspects of the container format:

  1. The format(s) that individual records can appear in;
  2. The scoping rules for determining how to parse individual records;
  3. The parser state machine and advancement mechanism.

Let’s do each, one-by-one.

Record formats

As mentioned above: all records start with an abbreviation ID. More concretely, all records start with either UNABBREV_RECORD (i.e., ID 3) or a defined abbreviation ID that refers back to a DEFINE_ABBREV record that applies to the scope that the record is in.

UNABBREV_RECORDs are called unabbreviated records, while all others are abbreviated records (since, as above, their format is defined by an abbreviation definition elsewhere in the stream). Unabbreviated records are simpler, so we’ll start with them.

Unabbreviated records

The UNABBREV_RECORD format is an inefficient, general purpose format for contexts where custom record definitions would be overly complicated or would not yield meaningful size improvements (such as within the BLOCKINFO block). Records in the UNABBREV_RECORD format look like this:

1
[code:VBR6, numops:VBR6, op0:VBR6, op1:VBR6, ...]

(Here and below: the :VBR<N> and :<N> suffixes indicate variable-width and fixed-width integers, respectively. See the LLVM documentation and the previous post for more details on VBRs.)

To tie it together:

By way of example, consider the following string representation of an UNABBREV_RECORD:

1
[0b000100, 0b000011, 0b010000, 0b011111, 0b000001, 0b100000]

This defines a new record of code 4 (0b000100) with three fields:

Note that field 2 is 12 bits instead of 6, since the value 32 isn’t representable in a single VBR6 block!

DEFINE_ABBREV and abbreviated records

Now that we know how to parse unabbreviated records, let’s move into the heart of the bitstream: abbreviation definitions and records that are abbreviated using them.

As mentioned in the last post, LLVM’s bitstream container is fundamentally a self-describing format: emitters are strongly encouraged to use DEFINE_ABBREV to produce compact bitstreams instead of relying on the more general, less compact UNABBREV_RECORD encoding.

To parse an abbreviated record, we need to get its corresponding abbreviation definition, i.e. the appropriately scoped DEFINE_ABBREV that defines the structure we’re about to parse. The actual scoping rules for acquiring this DEFINE_ABBREV are described immediately below, so we’ll just assume them for this section.

Once we actually have the DEFINE_ABBREV, it looks like this:

1
[numabbrevops:VBR5, abbrevop0, abbrevop1, ...]

Looks a lot like the UNABBREV_RECORD structure above! Some salient differences:

Parsing the abbreviation operands happens in a loop over numabbrevops, just like with UNABBREV_RECORD. Each operand has the following bit-form:

1
[encoded:1, ...]

…where encoded is a 1-bit field that indicates whether the operand is “encoded”. When encoded is 1, the operand is a “literal” operand of the following form:

1
[litvalue:VBR8]

“Literal” operands are the exception to the conceptual “concrete vs symbolic” rule from earlier3: when a DEFINE_ABBREV has a literal operand, that operand’s record field has the same value in all records that use the same abbreviation. The obvious application of this is record codes, since we expect most records that share an abbreviation to have the same code. However, LLVM doesn’t actually enforce that only the first operand can be literal (or must be literal), so there’s nothing stopping a bitstream emitter from using literals for other record fields.

By contrast, when encoded is 0, the parser elaborates into one of the following forms:

1
2
[encoding:3]
[encoding:3, value:VBR5]

The elaborated form is determined by the value of encoding, chiefly:

Note that encoding forms 0, 6, and 7 are not defined. LLVM could presumably add new encodings in a future release, but current parsers should probably error on encountering these5.

When fully parsed, each DEFINE_ABBREV tells us exactly how to parse any abbreviated record that uses it. For example, consider the following symbolic DEFINE_ABBREV:

1
[DEFINE_ABBREV, 0b00010, 0b1, 0b00001000, 0b0, 0b010, 0b01110]

Broken down:

To understand how this DEFINE_ABBREV relates to an actual abbreviated record, let’s assume that its abbreviation ID is 16 and that the current abbreviation ID width is 6. Then, the following bitstring:

1
[0b010000, 0b00000011111111]

Parses as:

…which amounts to just 20 bits per record in the trivial case, instead of over 24 bits in the UNABBREV_RECORD form7.

DEFINE_ABBREV and abbreviated records are complicated! But most of the complexity is in the concept, not the implementation: under the hood, they use the exact same primitives (VBRs and fixed-width fields) as the rest of the bitstream. The only complexity is in the indirection.

Scoping rules and BLOCKINFO

When talking about abbreviation definitions, I glazed over an important detail: actually mapping a defined abbreviation ID to its corresponding abbreviation definition. This is where the bitstream’s scoping rules come into play.

In short, the abbreviation ID that determines which DEFINE_ABBREV an abbreviated record is mapped to is calculated as follows:

The only other scoping rule is scope closure: the abbreviations (and abbreviation IDs) defined for a block are only applicable in exactly that block — neither its parent nor its children can use the same abbreviation definition list. In other words: each block must compute its own set of effective abbreviation definitions (and their IDs).

Oh, but wait: there’s one slightly ugly exception: BLOCKINFO itself. When in BLOCKINFO, DEFINE_ABBREV does not apply to the BLOCKINFO block itself; instead, it applies to whichever block ID was last set by the special SETBID record. This means that we have to use a little state machine to collect the abbreviation definitions provided by a BLOCKINFO block:

The BLOCKINFO parsing state machine.

(There are other records in the BLOCKINFO block, but SETBID and DEFINE_ABBREV are the only ones currently needed for correct bitstream parsing.)

The bitstream state machine

Now that we know how to interpret individual records and appropriately associate abbreviation definitions with abbreviated record forms, we can think about the overarching task of actually parsing an entire bitstream.

While working on my own bitstream parser, I found it useful to think of the bitstream as a state machine with 3 pieces of top-level internal state:

Given this state, we can define the initialization mechanism for the bitstream parser:

  1. Before our first parse, push an initial scope onto S:

    1
    
     { abbrev_width: 2, block_id: None, blockinfo_block_id: None, abbrevs: [] }
    
  2. Advance C by S.top.abbrev_width bits. The resulting value is the first abbreviation ID, and must be a ENTER_SUBBBLOCK; any other ID represents an invalid state.

    Use the ENTER_SUBBBLOCK format to populate a new scope, and push it onto S:

    1
    
     [blockid:VBR8, newabbrevlen:VBR4, <align32bits>, blocklen:32]
    

    In particular, the new scope contains:

    1
    
     { abbrev_width: newabbrevlen, block_id: Some(blockid), blockinfo_block_id: None, abbrevs: [] }
    

Then, the primary advancement mechanism:

  1. Fill the S.top.abbrevs for the current scope with any abbreviation definitions that match S.top.block_id in B.

  2. If we are in a BLOCKINFO block, use the BLOCKINFO-specific state machine to (further) populate B8.

  3. Advance by S.top.abbrev_width:

    1. If our next entry is a record, parse it with the appropriate strategy (abbreviated or unabbreviated). Go to 3.

    2. If our next entry is a new block (ENTER_SUBBLOCK), populate a new scope and push it onto S as we do in the initialization phase. Go to 1.

    3. If our next entry is a block end (END_BLOCK), pop the current scope top. Go to 3.

Visualized, roughly (ignoring scope state management and BLOCKINFO):

The main bitstream parsing state machine.

Putting it all together

Here’s where I talk about the actual implementation. To the best of my knowledge, it’s only the third LLVM bitstream parser publicly available, and only the second implementation that’s entirely independent from the LLVM codebase (the other being Galois’s llvm-pretty-bc-parser)9.

I’ve done all of the above in a new Rust library: llvm-bitstream.

You can try it out today by building the dump-bitstream example program that it comes with:

1
2
3
$ git clone https://github.com/woodruffw/mollusc && cd mollusc
$ cargo build --examples
$ cargo run --example dump-bitstream some-input.bc

Yields something like this (excerpted):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Entered bitstream; magic: 0xDEC04342
BLOCK 13 {
  RECORD { code: 1, values: [Array([Char6(37), Char6(37), Char6(47), Char6(38), Char6(53), Char6(52), Char6(62), Char6(52), Char6(62), Char6(52)])] }
  RECORD { code: 2, values: [Unsigned(0)] }
}
BLOCK 8 {
  RECORD { code: 1, values: [Unsigned(2)] }
  BLOCK 17 {
    RECORD { code: 1, values: [Unsigned(6)] }
    RECORD { code: 7, values: [Unsigned(32)] }
    RECORD { code: 21, values: [Unsigned(0), Array([Unsigned(0), Unsigned(0), Unsigned(0)])] }
    RECORD { code: 8, values: [Unsigned(1), Unsigned(0)] }
    RECORD { code: 16, values: [] }
    RECORD { code: 8, values: [Unsigned(0), Unsigned(0)] }
    RECORD { code: 2, values: [] }
  }

... snip ...

The output isn’t very pretty or useful at the moment, but it demonstrates full consumption of the stream itself, including internal interpretation of the BLOCKINFO block and abbreviation handling. Those parse details aren’t surfaced to the public API, meaning that users don’t need to worry about them when consuming a bitstream — all blocks and records appear above their extra bit representation, preventing any accidental dependencies on particular layout decisions made by a bitstream emitter. All in only about 600 lines of Rust!

From here, there’s a great deal of work to be done:

Up next: each of the above. Be on the lookout for another post in this series, probably next month!


  1. This isn’t exactly true, since abbreviated records specify higher level semantic and structural information (such as when a particular field is a Char6 instead of a 6-bit fixed-width integer). But this is effectively unnecessary for writing a correct bitstream parser; it only tells us a bit more about the eventual expected record structure. 

  2. Put another way: block ID #8 always refers to the same “kind” of block within a particular bitstream, but record code #4 will refer to a different kind of record depending on the ID of the enclosing block. 

  3. One of several places where LLVM’s bitstream format is conceptually very dense, for arguably little gain. 

  4. Another of these places where the bitstream format is very complicated. Why do array operands “steal” their following operand, instead of using another encoding form that embeds the operand itself? I assume the LLVM developers had a good reason for doing this, but it confused the bejeezus out of me when writing my parser. 

  5. A good example: it’d be great if signed VBRs had their own encoding code, since this would make their use in records actually self-describing rather than a matter of knowing that a particular record code in a particular block happens to use them instead of “normal” VBRs. As-is, the code that will eventually be responsible for mapping to IR objects will need to special-case these. But again, I’m sure there’s a good historical or engineering reason why the bitstream format doesn’t include this. 

  6. Which, again, is context-specific: a code of 8 means something in one block ID, and another thing in another block ID. 

  7. The UNABBREV_RECORD form would be 6 bits for the code, 6 for the operand count, and then multiple 6-bit VBR blocks for each concrete operand (since we can’t use the optimal 14-bit VBR form). In the naive case we’d expect this to be around 24 to 30 bits per record, depending on the value distribution. 

  8. As far as I can tell, there’s no rule against a bitstream having multiple BLOCKINFO blocks. I can’t think of any reason why you’d want that, but it doesn’t seem to be forbidden and has well-ish defined semantics (abbreviations would presumably be inserted in visitation order). 

  9. I would have loved to use this implementation as a cross-reference, but my Haskell is nowhere near good enough. 


Discussions: Reddit