LLVM's getelementptr, by example

I’ve been working inside of codebases built on LLVM for a little over two years. I’m used to (and very comfortable) reading LLVM IR.

Despite all that, I still need to stop and think about getelementptr’s semantics whenever I see it. Thus: this post will serve to help myself (and maybe you, reader) feel more comfortable with getelementptr.

Background

The LLVM Project has many parts, but is perhaps best known by ordinary developers for being the home of Clang, an optimizing C/C++ compiler¹.

Beyond being an impressive piece of compiler engineering, LLVM’s greatest claim to fame is its Intermediate Representation, or IR. Having a (well-designed, mature, stable) IR confers several advantages to an optimizing compiler:

Broadly speaking, LLVM’s IR is divided into four (increasingly narrow) scopes:

Instructions, in turn, take Values. Functions, basic blocks, and instructions are themselves kinds of values² (among many other kinds), completing the picture.

First steps

Here’s how the LLVM Language Reference defines and describes getelementptr as of LLVM 10:

1
2
3
<result> = getelementptr <ty>, <ty>* <ptrval>{, [inrange] <ty> <idx>}*
<result> = getelementptr inbounds <ty>, <ty>* <ptrval>{, [inrange] <ty> <idx>}*
<result> = getelementptr <ty>, <ptr vector> <ptrval>, [inrange] <vector index type> <idx>

In other words: it’s a souped up LLVM version of x86’s lea: it computes the effective address of some field without actually touching any memory. I say “souped up” because it can do a few things that (a single) lea can’t:

I’ve read this official overview dozens of times, and I basically understand it. Nonetheless, my eyes still swim whenever I actually read a getelementptr in real IR. LLVM’s developers are aware of how confusing getelementptr is: they have an entire separate page dedicated to explaining its syntax, why it doesn’t touch memory, and the various constraints on constructing a valid one.

Unfortunately, what that page doesn’t have is a ton of side-by-side examples with real C code. So that’s what I’m going to do today. All examples will use global variables and -fno-discard-value-names to make the generated IR a bit easier to read.

Basic addressing

Let’s take a look at a trivial C function that just returns the first element in a global array:

1
2
3
4
5
6
7
8
@nums = dso_local global i64* inttoptr (i64 1 to i64*), align 8

define dso_local i64 @index_first() #0{
  %0 = load i64*, i64** @nums, align 8
  %arrayidx = getelementptr inbounds i64, i64* %0, i64 0
  %1 = load i64, i64* %arrayidx, align 8
  ret i64 %1
}

That’s not too bad at all! getelementptr behaves exactly like the simple index operation that it models.

What if we change nums to an array? Arrays decompose to pointers in C so it should be the same, right?

Wrong! LLVM IR isn’t C, and doesn’t play by C’s rules. In particular, it has sized array types that can be cast to pointers, but that don’t decompose on their own:

1
2
3
4
5
6
@nums = dso_local global [3 x i64] [i64 1, i64 2, i64 3], align 16

define dso_local i64 @index_first() #0 {
  %0 = load i64, i64* getelementptr inbounds ([3 x i64], [3 x i64]* @nums, i64 0, i64 0), align 16
  ret i64 %0
}

This is simpler in some ways and more complicated in others: getelementptr is now invoked inline within the load⁴, but we’ve also saved a load and two SSA variables.

As best I can tell, this is a quirk of the new type of nums: it’s not a pointer anymore, but an array. However, because we’re accessing it through a pointer (because that’s how getelementptr is defined), we actually need two 0 indexes: the first to piece the pointer, and the second to index the array itself. Same operation, slightly more indirection within the IR⁵.

More indirection

1
2
3
4
5
6
7
8
9
@nums = dso_local global [3 x i64] [i64 1, i64 2, i64 3], align 16
@i = common dso_local global i64 0, align 8

define dso_local i64 @index_i() #0 {
  %0 = load i64, i64* @i, align 8
  %arrayidx = getelementptr inbounds [3 x i64], [3 x i64]* @nums, i64 0, i64 %0
  %1 = load i64, i64* %arrayidx, align 8
  ret i64 %1
}

It’s nearly identical to our second trivial example! The only thing that’s changed is our last index, going from 0 to %0 (which corresponds to our loaded i). This in turn confirms what we thought about the two i64 0s above: that the first (still present) simply pierces the pointer, while the second performs the “actual” indexing.

1
2
3
4
5
6
7
8
9
@nums = dso_local local_unnamed_addr global [3 x [3 x i64]] [[3 x i64] [i64 1, i64 2, i64 3], [3 x i64] [i64 2, i64 3, i64 4], [3 x i64] [i64 3, i64 4, i64 5]], align 16
@i = common dso_local local_unnamed_addr global i64 0, align 8

define dso_local i64 @index_i2() local_unnamed_addr #0 {
  %0 = load i64, i64* @i, align 8
  %arrayidx1 = getelementptr inbounds [3 x [3 x i64]], [3 x [3 x i64]]* @nums, i64 0, i64 %0, i64 %0
  %1 = load i64, i64* %arrayidx1, align 8
  ret i64 %1
}

Exactly as expected: all we’re doing is indexing one layer deeper into the aggregate with the same index (%0) and, sure, enough, our getelementptr has one additional argument.

Adding a little bit of math makes it clear that the order of the indexes is as we expect:

1
2
3
4
5
6
7
8
9
10
@nums = dso_local local_unnamed_addr global [3 x [3 x i64]] [[3 x i64] [i64 1, i64 2, i64 3], [3 x i64] [i64 2, i64 3, i64 4], [3 x i64] [i64 3, i64 4, i64 5]], align 16
@i = common dso_local local_unnamed_addr global i64 0, align 8

define dso_local i64 @index_i2() local_unnamed_addr #0 {
  %0 = load i64, i64* @i, align 8
  %add = add nsw i64 %0, 1
  %arrayidx1 = getelementptr inbounds [3 x [3 x i64]], [3 x [3 x i64]]* @nums, i64 0, i64 %0, i64 %add
  %1 = load i64, i64* %arrayidx1, align 8
  ret i64 %1
}

Structures and other composites

We’ve been doing pointers and arrays so far, but getelementptr is well defined for any aggregate, including structures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
typedef struct foo {
  struct {
    long field1;
    struct {
      long field2;
      long field3;
      union {
        long field4;
        char field5[32];
      } quux;
      long field6;
      long field7;
    } baz;
    long field8;
  } bar;
} foo;

foo chunky;

char take_field(void) {
  return chunky.bar.baz.quux.field5[17];
}

1
2
3
4
5
6
7
8
9
10
11
%struct.foo = type { %struct.anon }
%struct.anon = type { i64, %struct.anon.0, i64 }
%struct.anon.0 = type { i64, i64, %union.anon, i64, i64 }
%union.anon = type { i64, [24 x i8] }

@chunky = common dso_local local_unnamed_addr global %struct.foo zeroinitializer, align 8

define dso_local signext i8 @take_field() local_unnamed_addr #0 {
  %0 = load i8, i8* getelementptr inbounds (%struct.foo, %struct.foo* @chunky, i64 0, i32 0, i32 1, i32 2, i32 1, i64 9), align 1
  ret i8 %0
}

We’re off to a good start again: we’re addressing into a foo, and the foo that we’re addressing into is the chunky global. Then, in sequence:

First of all, it’s not actually a union in LLVM land. LLVM does not have its own union type.

However, if we look more closely, we’ll see that LLVM has engaged in some cleverness: %union.anon is just another structure, but its second field (our field5) is 24 bytes instead of 32 — the first 8 are in field4, just as the C abstract machine expects. Thus, LLVM has correctly modeled C’s union semantics without unions of its own.

So, whence index 9? Well, remember that we’re dealing with a [24 x i8], i.e. the last 24 bytes of field5. In other words, by virtue of our previous i32 1 index, we’ve already moved past the first sizeof(i64) == 8 bytes of field5, and 8 + 9 == 17! Our index! LLVM hasn’t let us down.

But wait: what happens if we try to grab a part of field5 that overlaps with field4, like field5[6]?

1
%0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* bitcast (%union.anon* getelementptr inbounds (%struct.foo, %struct.foo* @chunky, i64 0, i32 0, i32 1, i32 2) to [32 x i8]*), i64 0, i64 6), align 2

So, we have a nested getelementptr expression, but it’s not nearly as bad as it looks:

In other words: we sometimes need two (or more) getelementptrs⁷ to compute the effective addresses of union fields.

Vectorized address computation

To my embarrassment, the first structure example above took me about 30 minutes to understand. So why stop the ~~beating~~ education there?

As mentioned all the way at the beginning, getelementptr can take and return entire vectors of addresses and indexes.

I spent about an hour trying to get LLVM to emit a vectorized getelementptr, and failed miserably. So I’m just going to contrive a few.

First, let’s look at a basic case: indexing every address in a vector with a constant scalar value:

Observe that the first argument is not a vector: it’s still just the type of thing we’re computing addresses for.

The second is where we introduce our vector: 8 i64s. More precisely, it’s a pointer in this context: even with vectors, getelementptr expects its base address to be a pointer.

Finally, our constant scalar 0. This gets multiplexed across every member of %tbl, amounting to:

If you understood the above, then this one shouldn’t be too big of a next step: instead of always grabbing index 0 in %tbl, we’re grabbing whatever value is in %offset. In effect:

First, observe that %offsets is the same size as %tbl: they’re both <8 x i64>. This is an invariant of the vectorized form: you can’t use differently sized vectors in getelementptr.

This is because getelementptr is simple: it doesn’t do anything matrix-y, it just maps each member of the base (or intermediate) vector to its corresponding index:

To cap things off: remember that there’s no limit on the number of index arguments that a getelementptr may have, and that they don’t have to be in any particular order:

1
2
3
4
5
6
7
8
9
long *foo[8];
long tbl[8], offsets[8];
long x;

foo[0] = &tbl[offsets[0][x][0][6]];
foo[1] = &tbl[offsets[1][x][1][6]];
foo[2] = &tbl[offsets[2][x][2][6]];
// ...
foo[7] = &tbl[offsets[7][x][3][6]];

Other bits

I haven’t talked at all about some of the possible keywords on the getelementptr instruction, like inbounds and inrange.

Wrapup

All told: getelementptr is a little unintuitive and verbose, but not nearly as bad as I thought it was.

It’s also extremely precise: getelementptr tells us exactly what kind of address it’s calculating, where it’s beginning, and the exact “path” through the aggregate that results in a final effective address. It’s generic and flexible, just like LLVM itself.

And, by a significant margin, the only serious open-source competitor against GCC. ↩
Very literally: llvm::Function et al. are derived from llvm::Value. ↩
You’ll probably notice that we load from an i64**, despite nums being just an i64*. This is more LLVM semantic funniness: load takes a <ty>* for its source, so loading an i64* requires i64**. ↩
If you’re a stickler, you’ll point out that these inline GEPs are not technically instructions in LLVM-land, but actually constant expressions (i.e. llvm::GetElementPtrConstantExpr instead of llvm::GetElementPtrInst). But the semantics of the two are identical for my explanatory purposes, so sue me. ↩
Sure enough, the LLVM docs explain this. ↩
I have no clue as to why LLVM switches between i64 and i32 in this GEP. Most LLVM documentation seems to suggest that it just doesn’t matter. ↩
But again, for the pedants: these are constant expressions instead of “real” GEPs, so they’re essentially free. ↩
Without inbounds, getelementptr will always return some result address, which may or may not be valid. ↩

ENOSUCHBLOG

Programming, philosophy, pedaling.