LLVM internals, part 3: from bitcode to IR

The last two posts of this series have focused on the nitty-gritty of LLVM’s bitstream format:

This post marks a turning point: now that we have reasonable abstractions for the bitstream container itself, we can focus on mapping it into a form that resembles LLVM’s IR. We’ll cover some of the critical steps in that process¹ below, introducing a new crate (llvm-mapper) in the process.

Also, critically: this post is the first in the series where our approach to parsing and interpreting LLVM’s bitcode differs significantly from how LLVM’s own codebase does things. The details of the post should still be interesting to anyone who wants to learn the details of how an IR-level LLVM module is constructed, but will not reflect how LLVM itself does that construction².

Step 1: Unrolling the bitstream

1
2
3
4
5
6
7
8
// `from` parses both "raw" bitstreams and ones that have a bitcode wrapper,
// so discard the wrapper as _ if we see one.
let (_, stream) = Bitstream::from(raw_bitcode)?;

for entry in stream {
  // This is a lazy iterator, so each entry's parse can fail.
  println!("{:?}", entry?);
}

This produces a linear sequence of stream entries, each of which can be one of the core bitstream types: a new block, a record, or the end of a block:

1
[BLOCK(n), RECORD, RECORD, BLOCK(k), RECORD, RECORD, END_BLOCK, RECORD, ..., END_BLOCK]

This is easy to emit, but it isn’t easy to consume! Instead, we want a hierarchical structure that reflects the owning relationship between records and blocks, like so:

1
2
3
4
5
6
7
8
9
10
BLOCK(n)
  RECORD
  RECORD
  BLOCK(k)
    RECORD
    RECORD
  END_BLOCK
  RECORD
  ...
END_BLOCK

To get to the latter, we have to “unroll” the bitstream into a hierarchy of blocks and records. More accurately, we have to conform to LLVM’s understanding of the minimal context associated with every IR module:

Oh, and a bitstream can contain multiple IR modules to boot (thanks to llvm-cat⁴ and LTO). So our data flow and model really look like this:

1
2
3
Bitstream -> UnrolledBitstream
UnrolledBitstream -> UnrolledBitcode
UnrolledBitcode -> [BitcodeModule]

The code for actually performing this unrolling is a little too long to paste inline, but you can view it here. To summarize:

The end result: a mapped hierarchy of blocks and records, corresponding to one or more LLVM IR modules.

Step 2: From unrolled blocks to IR models

In the process above, one of our end results is a BitcodeModule for each appropriate sequence of top-level blocks in the bitstream. That BitcodeModule looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
pub struct BitcodeModule {
    /// The identification block associated with this module.
    pub identification: Identification,

    /// The module block associated with this module.
    pub module: Module,

    /// The string table associated with this module.
    pub strtab: Strtab,

    /// The symbol table associated with this module, if it has one.
    pub symtab: Option<Symtab>,
}

But wait: where did Identification, Module, &c come from? We’ve skipped a critical step!

As hinted in the reification step during unrolling, we have to map various features in the bitstream into their corresponding IR-level concepts. This process comes with its own subtleties:

Each of these considerations requires significant, dirty-handed engineering, particularly in the context of a pure Rust implementation: we can’t⁸ leverage god-like global context objects the way LLVM’s C++ internals can.

Mapping individual blocks and records

The heart of our mapping abstraction in llvm-mapper is the Mappable trait, which (currently) looks something like this:

1
2
3
4
5
6
7
/// A trait for mapping some raw `T` into a model type.
pub(crate) trait Mappable<T>: Sized {
    type Error;

    /// Attempt to map `T` into `Self` using the given [`MapCtx`](MapCtx).
    fn try_map(raw: &T, ctx: &mut MapCtx) -> Result<Self, Self::Error>;
}

That’s it! All it does is take an input T by reference, and (fallibly) produce the implementing type from it, using the associated error type when the operation fails. The associated MapCtx holds the ugly context details that I mentioned above: copies of the module-level string table, version information, and anything else that might influence mapping decisions⁹.

This abstraction ends up being reasonable for both blocks and records: thanks to unrolling, blocks now contain (and own) all of their constituent records. Records in turn should be independent of their adjacent records, with sparing exceptions handled by the MapCtx.

Because blocks are uniquely identified by their block ID (and all block IDs share the same namespace), we use an additional IrBlock trait with a blanket implementation to strongly enforce that the block ID is checked:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/// A trait implemented by all blocks that correspond to IR models, allowing them
/// to be mapped into their corresponding model.
pub(crate) trait IrBlock: Sized {
    /// The `IrBlockId` that corresponds to this IR model.
    const BLOCK_ID: IrBlockId;

    /// Attempt to map the given block to the implementing type, returning an error if mapping fails.
    ///
    /// This is an interior trait that shouldn't be used directly.
    fn try_map_inner(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, BlockMapError>;
}

impl<T: IrBlock> Mappable<UnrolledBlock> for T {
    type Error = BlockMapError;

    fn try_map(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, Self::Error> {
        if block.id != BlockId::Ir(T::BLOCK_ID) {
            return Err(BlockMapError::BadBlockMap(format!(
                "can't map {:?} into {:?}",
                block.id,
                Identification::BLOCK_ID
            )));
        }

        IrBlock::try_map_inner(block, ctx)
    }
}

In plain English: individual blocks don’t implement Mappable<UnrolledBlock>. Instead, they implement IrBlock, which ensures that the blanket block ID check is always performed.

The simplest example of this is the IDENTIFICATION_BLOCK, corresponding to the Identification struct that we saw referenced earlier:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/// Models the `IDENTIFICATION_BLOCK` block.
#[non_exhaustive]
#[derive(Debug)]
pub struct Identification {
    /// The name of the "producer" for this bitcode.
    pub producer: String,
    /// The compatibility epoch.
    pub epoch: u64,
}

impl IrBlock for Identification {
    const BLOCK_ID: IrBlockId = IrBlockId::Identification;

    fn try_map_inner(block: &UnrolledBlock, _ctx: &mut MapCtx) -> Result<Self, BlockMapError> {
        let producer = {
            let producer = block.one_record(IdentificationCode::ProducerString as u64)?;

            producer.try_string(0)?
        };

        let epoch = {
            let epoch = block.one_record(IdentificationCode::Epoch as u64)?;

            epoch.get_field(0)?
        };

        Ok(Self { producer, epoch })
    }
}

…and rinse and repeat for the dozen plus other blocks that can be present in a valid bitcode file. Plus their internal records, which vary widely in terms of needed context and parsing strategy¹⁰. Correctly mapping each of these (and testing them) is going to be a task on the order of months, if not years. LLVM is big!

Next steps

As of this post, here’s what the various mollusc crates are capable of producing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
$ ./target/debug/examples/unroll-bitstream ./sum.bc

UnrolledBitcode {
    modules: [
        BitcodeModule {
            identification: Identification {
                producer: "LLVM10.0.0",
                epoch: 0,
            },
            module: Module {
                version: 2,
                triple: "x86_64-pc-linux-gnu",
                datalayout: DataLayout {
                    endianness: Little,
                    natural_stack_alignment: Some(Align { byte_align: 16 }),
                    program_address_space: AddressSpace(0),
                    global_variable_address_space: AddressSpace(0),
                    alloca_address_space: AddressSpace(0),
                    type_alignments: TypeAlignSpecs(
                        [
                            TypeAlignSpec {
                                aligned_type: Aggregate,
                                abi_alignment: Align { byte_align: 8 },
                                preferred_alignment: Align { byte_align: 8 },
                            },
                            /* many more */
                        ],
                    ),
                    pointer_alignments: PointerAlignSpecs(
                        [
                            PointerAlignSpec {
                                address_space: AddressSpace(0),
                                abi_alignment: Align { byte_align: 8 },
                                preferred_alignment: Align { byte_align: 8 },
                                pointer_size: 64,
                                index_size: 64,
                            },
                            PointerAlignSpec {
                                address_space: AddressSpace(270),
                                abi_alignment: Align { byte_align: 4 },
                                preferred_alignment: Align { byte_align: 4 },
                                pointer_size: 32,
                                index_size: 32,
                            },
                            PointerAlignSpec {
                                address_space: AddressSpace(271),
                                abi_alignment: Align { byte_align: 4 },
                                preferred_alignment: Align { byte_align: 4 },
                                pointer_size: 32,
                                index_size: 32,
                            },
                            PointerAlignSpec {
                                address_space: AddressSpace(272),
                                abi_alignment: Align { byte_align: 8 },
                                preferred_alignment: Align { byte_align: 8 },
                                pointer_size: 64,
                                index_size: 64,
                            },
                        ],
                    ),
                    aggregate_alignment: Align { byte_align: 1 },
                    function_pointer_alignment: None,
                    mangling: Some(Elf),
                    native_integer_widths: [ 8, 16, 32, 64 ],
                    non_integral_address_spaces: [],
                },
                asm: [],
                deplibs: [],
                type_table: TypeTable(
                    [
                        Integer(IntegerType { bit_width: 32 }),
                        Function(
                            FunctionType {
                                return_type: Integer(IntegerType { bit_width: 32 }),
                                param_types: [
                                    Integer(IntegerType { bit_width: 32 }),
                                    Integer(IntegerType { bit_width: 32 }),
                                ],
                                is_vararg: false,
                            },
                        ),
                        Pointer(
                            PointerType {
                                pointee: Function(
                                    FunctionType {
                                        return_type: Integer(IntegerType { bit_width: 32 }),
                                        param_types: [
                                            Integer(IntegerType { bit_width: 32 }),
                                            Integer(IntegerType { bit_width: 32 }),
                                        ],
                                        is_vararg: false,
                                    },
                                ),
                                address_space: AddressSpace(0),
                            },
                        ),
                        Pointer(
                            PointerType {
                                pointee: Integer(IntegerType { bit_width: 32 }),
                                address_space: AddressSpace(0),
                            },
                        ),
                        Metadata,
                        Void,
                    ],
                ),
            },
            strtab: Strtab( [ /* lots of bytes */ ], ),
            symtab: Some(Symtab( [ /* lots of bytes */ ])),
        },
    ],
}

That’s a lot, and we still haven’t even gotten to functions, blocks, and instructions yet!

I’m not 100% sure what the next post in this series will look like, or even if there will be one: the path forwards is less about LLVM’s internals and more a series of grueling engineering efforts to bring the various mollusc crates into compatibility with LLVM’s ability to parse its own bitcode. There are lots of details involved in that, but they boil down to the same fundamental mapping steps discussed above.

Longer term, I don’t really have a goal for mollusc. It’s not intended to be a drop-in replacement for various LLVM internals, and it’s not intended to be a replacement for LLVM itself. Maybe it’ll be useful for doing static analysis in pure Rust; I don’t know! To whit: if the work here interests you, please help me with it! But don’t confuse this for the RIIR to top all RIIRs — it is not and will never be a rewrite of LLVM in Rust.

In other news, I’ve been accumulating LLVM patchsets throughout this work. To highlight a couple:

All of these need review, but I’m a complete novice when it comes to the LLVM review process. If any readers would be kind enough to point me in the right direction for getting them reviewed and merged, I would be extremely grateful!

Critically, our process, not LLVM’s. LLVM’s bitcode reader is much lazier (and in some ways, much lower level) than the one I’ve implemented; you can find it here. ↩
Maybe in the future. ↩
I’m still not 100% clear on these: LLVM’s code makes it seem like they were introduced for speeding up LTO, but the STRTAB_BLOCK at the very least seems to now be the standard string table representation. SYMTAB_BLOCK appears to be a serialized representation of a C structure that provides indirect references into the corresponding string table, which I’m sure will be absolutely ~~painful~~ delightful to model in Rust. ↩ ↩²
Which, curiously, is increasingly hard to find references to in LLVM’s documentation. But it’s still installed with my distribution’s build of LLVM. ↩
Why partial? Because we want to make invalid states (e.g., a module that’s missing an IDENTIFICATION_BLOCK) unrepresentable in the final API that users are expected to consume. Rust makes us jump through some extra hoops to do that, but the end result that that every BitcodeModule is correct by construction. ↩
This reification step includes mapping into IR models. More on that in the next section of the post. ↩
Presumably LTO reasons. ↩
Of course, we actually can. But it wouldn’t be very idiomatic of us. ↩
Well, not quite: I haven’t fully figured out whether this is the best abstraction to use, particularly when only a small subset of blocks need context from other blocks. This may change in the future, but any changes will be API-internal ones. ↩
My current woe being LLVM’s type table (i.e., TYPE_BLOCK_ID_NEW). ↩

ENOSUCHBLOG

Programming, philosophy, pedaling.

LLVM internals, part 3: from bitcode to IR

Sep 14, 2021 Tags: llvm, rust Series: llvm-internals

Preword

Step 1: Unrolling the bitstream

Step 2: From unrolled blocks to IR models

Mapping individual blocks and records

Next steps