Nov 29, 2021 Tags: llvm, rust Series: llvm-internals
This post is at least a year old.
This is the last post I plan to do on parsing LLVM’s bitcode, at least for a while. I’ll keep working on the various libraries that I’ve started under the mollusc umbrella, but they won’t be accompanied by regular posts anymore — I’d like to refocus on writing smaller, less dense (read: lower-pressure) posts for a while.
Also, as a sort of update: all four of the patchsets1 that I mentioned in the previous post have been fully merged into LLVM, including a couple of documentation and bitcode parsing bugfixes. Many thanks to the LLVM maintainers for reviewing and approving my changes!
LLVM has no less2 than three different ways to represent some of the metadata that gets stored in its intermediate representation of a program: metadata, attributes, and intrinsics. All three are represented differently in the bitcode format and this post will focus only on attributes, but it’s important to understand the semantic difference between the three.
Using this excellent 2016 LLVM developers’ meeting presentation as a reference:
Intrinsics are internal functions whose semantics are well-defined by and well-known to LLVM.
Common examples of these include the
@llvm.memcpy.*
family for semantics
similar to the memcpy(3)
libc
routine,
as well as various mathematics intrinsics corresponding roughly to libm
routines (such as
@llvm.ceil
for
ceil(3)
).
Generation of these intrinsics by language frontends
(such as clang
) is vital to LLVM’s ability to perform a wide variety of memory and arithmetic
optimizations3, and clang
will not hesitate to produce them even for unoptimized runs:
1
2
3
4
5
6
7
8
9
10
11
12
int main(void) {
char x[1024];
char y[1024];
if (rand()) {
memcpy(x, y, 1024);
} else {
memcpy(y, x, 1024);
}
return 0;
}
yields:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
define dso_local i32 @main() #0 !dbg !9 {
%1 = alloca i32, align 4
%2 = alloca [1024 x i8], align 16
%3 = alloca [1024 x i8], align 16
store i32 0, i32* %1, align 4
call void @llvm.dbg.declare(metadata [1024 x i8]* %2, metadata !14, metadata !DIExpression()), !dbg !19
call void @llvm.dbg.declare(metadata [1024 x i8]* %3, metadata !20, metadata !DIExpression()), !dbg !21
%4 = call i32 @rand() #4, !dbg !22
%5 = icmp ne i32 %4, 0, !dbg !22
br i1 %5, label %6, label %9, !dbg !24
6: ; preds = %0
%7 = getelementptr inbounds [1024 x i8], [1024 x i8]* %2, i64 0, i64 0, !dbg !25
%8 = getelementptr inbounds [1024 x i8], [1024 x i8]* %3, i64 0, i64 0, !dbg !25
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 16 %7, i8* align 16 %8, i64 1024, i1 false), !dbg !25
br label %12, !dbg !27
9: ; preds = %0
%10 = getelementptr inbounds [1024 x i8], [1024 x i8]* %3, i64 0, i64 0, !dbg !28
%11 = getelementptr inbounds [1024 x i8], [1024 x i8]* %2, i64 0, i64 0, !dbg !28
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 16 %10, i8* align 16 %11, i64 1024, i1 false), !dbg !28
br label %12
12: ; preds = %9, %6
%13 = load i32, i32* %1, align 4, !dbg !30
ret i32 %13, !dbg !30
}
(View it on Godbolt.)
Consequently, LLVM’s intrinsics are correctness-bearing in IR programs: it is not safe, in the general case, to remove calls to intrinsic functions without providing an adequate substitute function4.
Attributes are markers that define special properties on a small subset of program features: functions (and their callsites), individual parameters, and return values.
In the textual representation of LLVM’s IR, attributes are associated with program features with
one of two syntaxes. For parameters and return values, attributes are shown inline at their definition sites.
Using a memcpy
intrinsic variant as an example:
1
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #3
Here, we have the following parameter attributes:
i8*
parameters are noalias
and nocapture
, meaning that (1) memory accesses through
each pointer are exclusive, and (2) that neither pointer is “captured” within the function, e.g.
by storing its value to a global or other memory location that outlives the function call.i8*
is writeonly
and our second is readonly
, which enforce precisely what they
say on the tin: that the function may only write or read through these respective pointers.i1
parameter is marked as immarg
, meaning that its parameter must be an immediate
value (along with some other constraints) — anything that would have to be further loaded
or executed to produce a value is invalid.Similarly for a definition of malloc
:
1
declare dso_local noalias align 16 i8* @malloc(i64) #2
Here our single parameter has no attributes of its own, but our return value has two: noalias
and align 16
.
Our second syntax is also visible in the snippets above: the #N
anchor ties each
function (or function callsite) to a list of attributes that apply to the entire function
or callsite.
For example, the memcpy
intrinsic is associated with #3
, which is:
1
attributes #3 = { argmemonly nofree nounwind willreturn }
(We’re getting into the weeds now, but a whirlwind tour: argmemonly
asserts that the function
only modifies memory through any pointer arguments it’s given; nofree
asserts that the function
does not call a free
-type function on its pointer argument(s); nounwind
asserts that the
function never raises an exception; willreturn
asserts that the function either contains UB or
returns back to its callsite in some call stack.)
Each of these, again, has important effects on the program’s correctness, so naively removing them5 is not safe. They also appear on many callsites, e.g.:
1
2
3
4
5
%5 = call i32 @rand() #4, !dbg !24
; ... snip ...
attributes #4 = { nounwind }
Metadata is a lot of things. Among them:
Source-level debugging information that LLVM is responsible for lowering into a format like DWARF or CodeView.
Virtually everything that ends up in one of these binary debugging formats corresponds
to one or more metadata entries (indicated textually as !N
) on LLVM instructions, functions,
&c. For example, here’s how LLVM relates an IR-level %struct
back to its
original C-level layout with metadata entries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct {
int x;
char y;
struct bar {
long z0;
} z;
} foo;
int main(void) {
foo x;
foo y;
/* ... snip ... */
}
Here, the x
and y
declarations get @llvm.dbg.declare
intrinsics with
associated metadata entries (comments added):
1
2
3
4
5
6
%1 = alloca i32, align 4
%2 = alloca %struct.foo, align 8 ; 'foo x'
%3 = alloca %struct.foo, align 8 ; 'foo y'
store i32 0, i32* %1, align 4
call void @llvm.dbg.declare(metadata %struct.foo* %2, metadata !12, metadata !DIExpression()), !dbg !24
call void @llvm.dbg.declare(metadata %struct.foo* %3, metadata !25, metadata !DIExpression()), !dbg !26
Following metadata !12
, we can see that it accurately describes both x
and its type, by way
of subsequent metadata entries:
1
2
3
4
5
6
7
8
9
10
11
12
!12 = !DILocalVariable(name: "x", scope: !7, file: !8, line: 13, type: !13)
!13 = !DIDerivedType(tag: DW_TAG_typedef, name: "foo", file: !8, line: 10, baseType: !14)
!14 = distinct !DICompositeType(tag: DW_TAG_structure_type, file: !8, line: 4, size: 128, elements: !15)
!15 = !{!16, !17, !19}
!16 = !DIDerivedType(tag: DW_TAG_member, name: "x", scope: !14, file: !8, line: 5, baseType: !11, size: 32)
!17 = !DIDerivedType(tag: DW_TAG_member, name: "y", scope: !14, file: !8, line: 6, baseType: !18, size: 8, offset: 32)
!18 = !DIBasicType(name: "char", size: 8, encoding: DW_ATE_signed_char)
!19 = !DIDerivedType(tag: DW_TAG_member, name: "z", scope: !14, file: !8, line: 9, baseType: !20, size: 64, offset: 64)
!20 = distinct !DICompositeType(tag: DW_TAG_structure_type, name: "bar", file: !8, line: 7, size: 64, elements: !21)
!21 = !{!22}
!22 = !DIDerivedType(tag: DW_TAG_member, name: "z0", scope: !20, file: !8, line: 8, baseType: !23, size: 64)
!23 = !DIBasicType(name: "long int", size: 64, encoding: DW_ATE_signed)
That’s a lot of metadata!
Optional optimization assistance whose presence or absence does not affect
program correctness. When present this metadata can be used to progressively enhance
some optimizations. For example, the
!range
metadatum can be attached to a
load
, call
, or invoke
instruction to indicate that the value produced by the instruction
can only be in one of the associated ranges.
For example, adapted from the LLVM LangRef:
1
2
3
4
5
6
7
; `%a` can only be {0, 1, 3, 4}
%a = load i8, i8* %x, align 1, !range !0
; ... snip ...
; valid ranges: [0, 2), [3, 5)
!0 = !{ i8 0, i8 2, i8 3, i8 5 }
Another great example is !callees
,
which helps LLVM perform
indirect call promotion
(which, in turn, decreases pressure on the indirect branch predictor at runtime).
So, to wrap things up: a correct bitcode parser needs to accurately handle intrinsics and
attributes, while metadata can be more or less ignored until a brave masochistic soul feels the
needs it for their own purposes.
Let’s get to it.
For reasons that are unclear to me, LLVM describes all attributes as “parameter attributes”
at the bitcode/bitstream level, even when said attributes refer to entire functions or function
return values. Similarly confusingly, LLVM splits said “parameter attributes” into two separate
bitstream level blocks: PARAMATTR_BLOCK
and PARAMATTR_GROUP_BLOCK
6.
The former references the latter (by way of indices), so the latter needs to be parsed first. As such, we’ll start with it.
PARAMATTR_GROUP_BLOCK
Here’s what LLVM’s bitcode docs have to say about this block:
The
PARAMATTR_GROUP_BLOCK
block (id 10) contains a table of entries describing the attribute groups present in the module. These entries can be referenced withinPARAMATTR_CODE_ENTRY
entries.
The PARAMATTR_GROUP_BLOCK
block can only contain one kind of record: PARAMATTR_GRP_CODE_ENTRY
.
Each PARAMATTR_GRP_CODE_ENTRY
record looks like this:
1
[ENTRY, grpid, paramidx, attr0, attr1, ...]
…where grpid
is a unique numeric identifier for this group of attributes,
and paramidx
identifies one of the following:
0
: This group contains attributes for the return value (i.e., call-site?7) of the
function that references it (by grpid
).0xFFFFFFFF
: This group contains attributes for the function itself.[1, 0xFFFFFFFF)
: This group contains attributes for the N
th function parameter,
starting at 1.We can represent this with a tidy enum
and a total mapping from u32
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#[derive(Clone, Copy, Debug)]
pub enum AttributeGroupDisposition {
Return,
Parameter(u32),
Function,
}
impl From<u32> for AttributeGroupDisposition {
fn from(value: u32) -> Self {
match value {
u32::MAX => Self::Function,
0 => Self::Return,
_ => Self::Parameter(value),
}
}
}
(Docs link.)
Each attrN
has, in turn, even more internal structure:
1
{ kind, key [, ...], [value [, ...]] }
…where kind
indicates the layout of key
and value
:
0
: key
is an integer indicating a “well-known” attribute, such as noalias
(9
).
These are sometimes referred to as “enum attributes” in the LLVM source code.
These attributes have no associated value — all necessary information is
wholly present in the key itself.1
: key
is an integer indicating a “well-known” attribute, and value
is
also an integer, representing a value associated with the attribute.
For example, dereferenceable(<n>)
(41
) takes <n>
via the value
field,
which in turn represents the address space that the dereferenceable
attribute
applies to.3
: key
is a null-terminated string, indicating a not well-known “string” attribute.
This string can be free-form and isn’t accompanied by a value. The absence of well-known
string attributes means that there’s no single source of truth for values that are understood
by various parts of LLVM. That being said, passing -fsplit-stack
on clang
13 seems to
reliably introduce the "split-stack"
string attribute to any function that isn’t
marked with __attribute__((no_split_stack))
:
1
2
3
4
5
6
define dso_local i32 @main() #2 !dbg !9 {
; ... snip ...
}
; ... snip ...
attributes #2 = { noinline nounwind optnone uwtable "split-stack" }
(View it on Godbolt.)
4
: key
is a null-terminated string, and value
is also a null-terminated string.
Like with the key-only variant, neither string is well-known and both are free-form.
These are comparatively common, and we can see LLVM translate C-level __attribute__
s
to them:
1
2
__attribute__((__target__("sse4.2"))) int add1(int x, int y);
__attribute__((__target__("sse4.1"))) int add2(int x, int y);
These become (simplified):
1
2
attributes #0 = { "target-features"="+cx8,+fxsr,+mmx,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87" }
attributes #2 = { "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" }
(View it on Godbolt.)
Observe that #2
doesn’t include sse4.2
in its target-features
value, since we’ve capped it
at sse4.1
.
The astute will notice that kind=2
isn’t specified. Why? Beats me8!
Once again, we can model this with a relatively tidy enum
:
1
2
3
4
5
6
7
8
9
// Each variant's value is exhaustively enumerated in turn, where possible.
// For example, `EnumAttribute` is not just a `u32` newtype but an exhaustive
// `enum` of all currently known "enum"-kinded LLVM attributes.
pub enum Attribute {
Enum(EnumAttribute),
Int(IntAttribute),
Str(String),
StrKeyValue(String, String),
}
(Docs link.)
…and to tie it all together, our model for each “attribute group” in PARAMATTR_GROUP_BLOCK
:
1
2
3
4
5
#[derive(Clone, Debug)]
pub struct AttributeGroup {
disposition: AttributeGroupDisposition,
attributes: Vec<Attribute>,
}
(Docs link.)
And our parsing procedure PARAMATTR_GROUP_BLOCK
:
grpid -> AttributeGroup
PARAMATTR_GRP_CODE_ENTRY
in PARAMATTR_GROUP_BLOCK
:
grpid
and paramidx
fields, which are always present and are always
the first two in the record. Convert the paramidx
into its corresponding
AttributeGroupDisposition
.fieldidx
as 2
, indicating that we’ve already
consumed grpid
and paramidx
.Attribute
s from the record using the kind
rules above, starting at fieldidx
and
increasing fieldidx
by the number of record fields consumed at each parse step.
Complete when fieldidx == record.fields().len()
.AttributeGroup
whose disposition and attributes are those just parsed.grpid
and AttributeGroup
to the mapping…and here’s what that looks like, in Rust:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
impl IrBlock for AttributeGroups {
const BLOCK_ID: IrBlockId = IrBlockId::ParamAttrGroup;
fn try_map_inner(block: &UnrolledBlock, _ctx: &mut MapCtx) -> Result<Self, BlockMapError> {
let mut groups = HashMap::new();
for record in block.all_records() {
let code = AttributeCode::try_from(record.code()).map_err(AttributeError::from)?;
if code != AttributeCode::GroupCodeEntry {
return Err(AttributeError::WrongBlock(code).into());
}
if record.fields().len() < 3 {
return Err(RecordMapError::BadRecordLayout(format!(
"too few fields in {:?}, expected {} >= 3",
code,
record.fields().len()
))
.into());
}
let group_id = record.fields()[0] as u32;
let disposition: AttributeGroupDisposition = (record.fields()[1] as u32).into();
let mut fieldidx = 2;
let mut attributes = vec![];
while fieldidx < record.fields().len() {
let (count, attr) = Attribute::from_record(fieldidx, record)?;
attributes.push(attr);
fieldidx += count;
}
if fieldidx != record.fields().len() {
return Err(RecordMapError::BadRecordLayout(format!(
"under/overconsumed fields in attribute group record ({} fields, {} consumed)",
fieldidx,
record.fields().len(),
))
.into());
}
groups.insert(
group_id,
AttributeGroup {
disposition,
attributes,
},
);
}
Ok(AttributeGroups(groups))
}
}
That leaves us with the final product of mapping the PARAMATTR_GROUP_BLOCK
block: a
mapping of grpid -> AttributeGroup
. Let’s see how the PARAMATTR_BLOCK
uses this mapping.
PARAMATTR_BLOCK
The other half of the attributes equation is the PARAMATTR_BLOCK
block, which is
documented by LLVM as follows:
The
PARAMATTR_BLOCK
block (id9
) contains a table of entries describing the attributes of function parameters. These entries are referenced by 1-based index in theparamattr
field of module blockFUNCTION
records, or within theattr
field of function blockINST_INVOKE
andINST_CALL
records.Entries within
PARAMATTR_BLOCK
are constructed to ensure that each is unique (i.e., no two indices represent equivalent attribute lists).
There are two valid record codes in the PARAMATTR_BLOCK
:
PARAMATTR_CODE_ENTRY_OLD
: This is a legacy record, emitted by LLVM 3.2 and earlier. We don’t
bother handling it, since modern versions of LLVM won’t emit it at all and 3.2 is
nearly a decade old at this point. However, for completeness,
it looks like this:
1
[ENTRY, paramidx0, attr0, paramidx1, attr1...]
Look familiar? That’s because it’s nearly identical to the PARAMATTR_GRP_CODE_ENTRY
record format
that we just comprehensively walked through: the only difference is the absence of the “group”
concept. LLVM (presumably) deprecated this format when someone realized that many parameters
and whole functions share entire groups of attributes, allowing for deduplication by referencing
the groups rather than listing all attributes every time.
PARAMATTR_CODE_ENTRY
: This is the referee record mentioned in the PARAMATTR_GROUP_BLOCK
documentation. Each PARAMATTR_CODE_ENTRY
looks like this:
1
[ENTRY, attrgrp0, attrgrp1, ...]
…where attrgrpN
is a attribute group index. And the circle is complete: we use our mapping of
the PARAMATTR_GROUP_BLOCK
to fetch each list of attributes corresponding to each group index,
and append them all together into one big list of each PARAMATTR_CODE_ENTRY
. These lists,
in turn, are referenced by 1-based indices in other blocks and records throughout the bitcode
representation.
That ends up being nice and simple:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
impl IrBlock for Attributes {
const BLOCK_ID: IrBlockId = IrBlockId::ParamAttr;
fn try_map_inner(block: &UnrolledBlock, ctx: &mut MapCtx) -> Result<Self, BlockMapError> {
let mut entries = vec![];
for record in block.all_records() {
let code = AttributeCode::try_from(record.code()).map_err(AttributeError::from)?;
match code {
AttributeCode::Entry => {
let mut groups = vec![];
for group_id in record.fields() {
let group_id = *group_id as u32;
log::debug!("group id: {}", group_id);
groups.push(
ctx.attribute_groups()?
.get(group_id)
.ok_or(AttributeError::BadAttributeGroup(group_id))?
.clone(),
);
}
entries.push(AttributeEntry(groups));
}
AttributeCode::GroupCodeEntry => {
// This is a valid attribute code, but it isn't valid in this block.
return Err(AttributeError::WrongBlock(code).into());
}
_ => {
return Err(BlockMapError::Unsupported(format!(
"unsupported attribute block code: {:?}",
code,
)))
}
}
}
Ok(Attributes(entries))
}
}
Parsing the blocks responsible for LLVM’s attributes was moderately troublesome: nowhere nearly
as annoying as the type table9, but not as easy as the identification block or the string
table. All told, the current implementation requires
slightly under 900 lines of code, much of which
is documentation and enum
variants.
The end result of it all can be seen with in the debug logs of the
unroll-bitstream
example provided by the llvm-mapper
crate:
1
RUST_LOG=debug ./target/debug/examples/unroll-bitstream some-input.bc
…which, amidst a great deal of other output, should yield some messages like this (formatted for readability):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[2021-11-29T06:54:51Z DEBUG llvm_mapper::block::module] attributes:
Some(Attributes([
AttributeEntry(
[
AttributeGroup {
disposition: Function,
attributes: [
Enum(NoInline),
Enum(NoUnwind),
Enum(OptimizeNone),
Enum(UwTable),
StrKeyValue("correctly-rounded-divide-sqrt-fp-math", "false"),
StrKeyValue("disable-tail-calls", "false"),
StrKeyValue("frame-pointer", "all"),
StrKeyValue("less-precise-fpmad", "false"),
StrKeyValue("min-legal-vector-width", "0"),
StrKeyValue("no-infs-fp-math", "false"),
StrKeyValue("no-jump-tables", "false"),
StrKeyValue("no-nans-fp-math", "false"),
StrKeyValue("no-signed-zeros-fp-math", "false"),
StrKeyValue("no-trapping-math", "false"),
StrKeyValue("stack-protector-buffer-size", "8"),
StrKeyValue("target-cpu", "x86-64"),
StrKeyValue("target-features", "+cx8,+fxsr,+mmx,+sse,+sse2,+x87"),
StrKeyValue("unsafe-fp-math", "false"),
StrKeyValue("use-soft-float", "false")
]
}
]
)
]))
This information isn’t exposed anywhere in mapped LLVM modules, yet: it’s kept purely as
state within the MapCtx
.
Future mapping work (e.g., for IR-level functions, blocks, instructions, &c) will
access that state to correctly associate themselves with their attributes.
As I’ve said in previous posts: mollusc still has no particular end goal or state
in mind, other than my broad goal of being able to perform some amount of static analysis of
LLVM IR in pure Rust. The beatings development will continue until the masochism curiosity
abates.
And possibly more; it’s a big project. In particular, I have no clue how “operand bundles” work or what they do. ↩
Why does LLVM need these intrinsics, rather than unconditionally inserting a call to e.g. an extremely optimized memcpy(3)
implementation? Because even that sometimes isn’t enough: sometimes you want to inline the call entirely, or to call a slightly different optimized memcpy
implementation. Or for even simpler reasons: having an intrinsic here gives LLVM a fuller picture of the program than an external call, even a well understood one, would normally allow. ↩
There are, however, some intrinsics than can be safely deleted without compromising program correctness. The @llvm.dbg.*
family is an example of this, but the intricacies of how these intrinsics interact with LLVM’s metadata facilities are well outside the scope of this post. Perhaps another time. ↩
Or failing to interpret them during code generation, or optimization, &c. ↩
This part would be less confusing if the PARAMATTR_BLOCK
and PARAMATTR_GROUP_BLOCK
blocks didn’t share a record code namespace. But they do, so it’s not clear why they’re separated at all, especially when they have a tight dependency relationship. ↩
I think? ↩
LLVM’s BitcodeReader
doesn’t mention kind=2
at all; see here. It does however mention kind=5
and kind=6
, despite these not being documented. I haven’t seen these in any bitcode samples yet so I haven’t bothered digging into them, but they’re apparently “type” attribute formats. ↩
Which had to be rewritten entirely after I published the last post in this series, due to unavoidable lifetime issues with the previous approach. ↩