LLVM internals, part 4: attributes and attribute groups

This is the last post I plan to do on parsing LLVM’s bitcode, at least for a while. I’ll keep working on the various libraries that I’ve started under the mollusc umbrella, but they won’t be accompanied by regular posts anymore — I’d like to refocus on writing smaller, less dense (read: lower-pressure) posts for a while.

Also, as a sort of update: all four of the patchsets¹ that I mentioned in the previous post have been fully merged into LLVM, including a couple of documentation and bitcode parsing bugfixes. Many thanks to the LLVM maintainers for reviewing and approving my changes!

Intrinsics, attributes, metadata?

LLVM has no less² than three different ways to represent some of the metadata that gets stored in its intermediate representation of a program: metadata, attributes, and intrinsics. All three are represented differently in the bitcode format and this post will focus only on attributes, but it’s important to understand the semantic difference between the three.

So, to wrap things up: a correct bitcode parser needs to accurately handle intrinsics and attributes, while metadata can be more or less ignored until a ~~brave~~ masochistic soul feels the needs it for their own purposes.

Parsing attributes from the bitcode

For reasons that are unclear to me, LLVM describes all attributes as “parameter attributes” at the bitcode/bitstream level, even when said attributes refer to entire functions or function return values. Similarly confusingly, LLVM splits said “parameter attributes” into two separate bitstream level blocks: PARAMATTR_BLOCK and PARAMATTR_GROUP_BLOCK⁶.

The former references the latter (by way of indices), so the latter needs to be parsed first. As such, we’ll start with it.

PARAMATTR_GROUP_BLOCK

The PARAMATTR_GROUP_BLOCK block can only contain one kind of record: PARAMATTR_GRP_CODE_ENTRY. Each PARAMATTR_GRP_CODE_ENTRY record looks like this:

…where grpid is a unique numeric identifier for this group of attributes, and paramidx identifies one of the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#[derive(Clone, Copy, Debug)]
pub enum AttributeGroupDisposition {
    Return,
    Parameter(u32),
    Function,
}

impl From<u32> for AttributeGroupDisposition {
    fn from(value: u32) -> Self {
        match value {
            u32::MAX => Self::Function,
            0 => Self::Return,
            _ => Self::Parameter(value),
        }
    }
}

1
2
3
4
5
6
7
8
9
// Each variant's value is exhaustively enumerated in turn, where possible.
// For example, `EnumAttribute` is not just a `u32` newtype but an exhaustive
// `enum` of all currently known "enum"-kinded LLVM attributes.
pub enum Attribute {
    Enum(EnumAttribute),
    Int(IntAttribute),
    Str(String),
    StrKeyValue(String, String),
}

…and to tie it all together, our model for each “attribute group” in PARAMATTR_GROUP_BLOCK:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
impl IrBlock for AttributeGroups {
    const BLOCK_ID: IrBlockId = IrBlockId::ParamAttrGroup;

    fn try_map_inner(block: &UnrolledBlock, _ctx: &mut MapCtx) -> Result<Self, BlockMapError> {
        let mut groups = HashMap::new();

        for record in block.all_records() {
            let code = AttributeCode::try_from(record.code()).map_err(AttributeError::from)?;

            if code != AttributeCode::GroupCodeEntry {
                return Err(AttributeError::WrongBlock(code).into());
            }

            if record.fields().len() < 3 {
                return Err(RecordMapError::BadRecordLayout(format!(
                    "too few fields in {:?}, expected {} >= 3",
                    code,
                    record.fields().len()
                ))
                .into());
            }

            let group_id = record.fields()[0] as u32;
            let disposition: AttributeGroupDisposition = (record.fields()[1] as u32).into();

            let mut fieldidx = 2;
            let mut attributes = vec![];
            while fieldidx < record.fields().len() {
                let (count, attr) = Attribute::from_record(fieldidx, record)?;
                attributes.push(attr);
                fieldidx += count;
            }

            if fieldidx != record.fields().len() {
                return Err(RecordMapError::BadRecordLayout(format!(
                    "under/overconsumed fields in attribute group record ({} fields, {} consumed)",
                    fieldidx,
                    record.fields().len(),
                ))
                .into());
            }

            groups.insert(
                group_id,
                AttributeGroup {
                    disposition,
                    attributes,
                },
            );
        }

        Ok(AttributeGroups(groups))
    }
}

That leaves us with the final product of mapping the PARAMATTR_GROUP_BLOCK block: a mapping of grpid -> AttributeGroup. Let’s see how the PARAMATTR_BLOCK uses this mapping.

PARAMATTR_BLOCK

The other half of the attributes equation is the PARAMATTR_BLOCK block, which is documented by LLVM as follows:

Takeaways

Parsing the blocks responsible for LLVM’s attributes was moderately troublesome: nowhere nearly as annoying as the type table⁹, but not as easy as the identification block or the string table. All told, the current implementation requires slightly under 900 lines of code, much of which is documentation and enum variants.

The end result of it all can be seen with in the debug logs of the unroll-bitstream example provided by the llvm-mapper crate:

…which, amidst a great deal of other output, should yield some messages like this (formatted for readability):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[2021-11-29T06:54:51Z DEBUG llvm_mapper::block::module] attributes:
  Some(Attributes([
    AttributeEntry(
      [
        AttributeGroup {
          disposition: Function,
          attributes: [
            Enum(NoInline),
            Enum(NoUnwind),
            Enum(OptimizeNone),
            Enum(UwTable),
            StrKeyValue("correctly-rounded-divide-sqrt-fp-math", "false"),
            StrKeyValue("disable-tail-calls", "false"),
            StrKeyValue("frame-pointer", "all"),
            StrKeyValue("less-precise-fpmad", "false"),
            StrKeyValue("min-legal-vector-width", "0"),
            StrKeyValue("no-infs-fp-math", "false"),
            StrKeyValue("no-jump-tables", "false"),
            StrKeyValue("no-nans-fp-math", "false"),
            StrKeyValue("no-signed-zeros-fp-math", "false"),
            StrKeyValue("no-trapping-math", "false"),
            StrKeyValue("stack-protector-buffer-size", "8"),
            StrKeyValue("target-cpu", "x86-64"),
            StrKeyValue("target-features", "+cx8,+fxsr,+mmx,+sse,+sse2,+x87"),
            StrKeyValue("unsafe-fp-math", "false"),
            StrKeyValue("use-soft-float", "false")
          ]
        }
      ]
    )
  ]))

This information isn’t exposed anywhere in mapped LLVM modules, yet: it’s kept purely as state within the MapCtx. Future mapping work (e.g., for IR-level functions, blocks, instructions, &c) will access that state to correctly associate themselves with their attributes.

As I’ve said in previous posts: mollusc still has no particular end goal or state in mind, other than my broad goal of being able to perform some amount of static analysis of LLVM IR in pure Rust. The ~~beatings~~ development will continue until the ~~masochism~~ curiosity abates.

D107536, D108441, D108962, and D109655. ↩
And possibly more; it’s a big project. In particular, I have no clue how “operand bundles” work or what they do. ↩
Why does LLVM need these intrinsics, rather than unconditionally inserting a call to e.g. an extremely optimized memcpy(3) implementation? Because even that sometimes isn’t enough: sometimes you want to inline the call entirely, or to call a slightly different optimized memcpy implementation. Or for even simpler reasons: having an intrinsic here gives LLVM a fuller picture of the program than an external call, even a well understood one, would normally allow. ↩
There are, however, some intrinsics than can be safely deleted without compromising program correctness. The @llvm.dbg.* family is an example of this, but the intricacies of how these intrinsics interact with LLVM’s metadata facilities are well outside the scope of this post. Perhaps another time. ↩
Or failing to interpret them during code generation, or optimization, &c. ↩
This part would be less confusing if the PARAMATTR_BLOCK and PARAMATTR_GROUP_BLOCK blocks didn’t share a record code namespace. But they do, so it’s not clear why they’re separated at all, especially when they have a tight dependency relationship. ↩
I think? ↩
LLVM’s BitcodeReader doesn’t mention kind=2 at all; see here. It does however mention kind=5 and kind=6, despite these not being documented. I haven’t seen these in any bitcode samples yet so I haven’t bothered digging into them, but they’re apparently “type” attribute formats. ↩
Which had to be rewritten entirely after I published the last post in this series, due to unavoidable lifetime issues with the previous approach. ↩

ENOSUCHBLOG

Programming, philosophy, pedaling.

LLVM internals, part 4: attributes and attribute groups

Nov 29, 2021 Tags: llvm, rust Series: llvm-internals

Preword

Intrinsics, attributes, metadata?

Parsing attributes from the bitcode

`PARAMATTR_GROUP_BLOCK`

`PARAMATTR_BLOCK`

Takeaways