ENOSUCHBLOG

Programming, philosophy, pedaling.


LLVM's getelementptr, by example

Sep 19, 2020     Tags: llvm, programming    

This post is at least a year old.

(With apologies to Branson Reese.)

Preword

I’ve been working inside of codebases built on LLVM for a little over two years. I’m used to (and very comfortable) reading LLVM IR.

Despite all that, I still need to stop and think about getelementptr’s semantics whenever I see it. Thus: this post will serve to help myself (and maybe you, reader) feel more comfortable with getelementptr.

Background

First of all, there’s this thing called LLVM.

LLVM originally stood Low Level Virtual Machine, but no longer stands for anything.

The LLVM Project has many parts, but is perhaps best known by ordinary developers for being the home of Clang, an optimizing C/C++ compiler1.

Beyond being an impressive piece of compiler engineering, LLVM’s greatest claim to fame is its Intermediate Representation, or IR. Having a (well-designed, mature, stable) IR confers several advantages to an optimizing compiler:

Broadly speaking, LLVM’s IR is divided into four (increasingly narrow) scopes:

Instructions, in turn, take Values. Functions, basic blocks, and instructions are themselves kinds of values2 (among many other kinds), completing the picture.

This blog post concerns a single LLVM instruction: getelementptr.

First steps

Here’s how the LLVM Language Reference defines and describes getelementptr as of LLVM 10:

1
2
3
<result> = getelementptr <ty>, <ty>* <ptrval>{, [inrange] <ty> <idx>}*
<result> = getelementptr inbounds <ty>, <ty>* <ptrval>{, [inrange] <ty> <idx>}*
<result> = getelementptr <ty>, <ptr vector> <ptrval>, [inrange] <vector index type> <idx>

The ‘getelementptr’ instruction is used to get the address of a subelement of an aggregate data structure. It performs address calculation only and does not access memory. The instruction can also be used to calculate a vector of such addresses.

In other words: it’s a souped up LLVM version of x86’s lea: it computes the effective address of some field without actually touching any memory. I say “souped up” because it can do a few things that (a single) lea can’t:

I’ve read this official overview dozens of times, and I basically understand it. Nonetheless, my eyes still swim whenever I actually read a getelementptr in real IR. LLVM’s developers are aware of how confusing getelementptr is: they have an entire separate page dedicated to explaining its syntax, why it doesn’t touch memory, and the various constraints on constructing a valid one.

Unfortunately, what that page doesn’t have is a ton of side-by-side examples with real C code. So that’s what I’m going to do today. All examples will use global variables and -fno-discard-value-names to make the generated IR a bit easier to read.

Basic addressing

Let’s take a look at a trivial C function that just returns the first element in a global array:

1
2
3
4
long *nums = {1, 2, 3};
long index_first(void) {
    return nums[0];
}

and the generated IR (cleaned up a bit):

1
2
3
4
5
6
7
8
@nums = dso_local global i64* inttoptr (i64 1 to i64*), align 8

define dso_local i64 @index_first() #0{
  %0 = load i64*, i64** @nums, align 8
  %arrayidx = getelementptr inbounds i64, i64* %0, i64 0
  %1 = load i64, i64* %arrayidx, align 8
  ret i64 %1
}

(View it on Godbolt.)

Now, step by step:

  1. We load our global nums into %0, with type i64*3
  2. We use getelementptr to calculate the address of an i64 (the first argument), using:

    …and we store our calculated address into %arrayidx.

  3. We load an i64 from our calculated address (%arrayidx) into %1
  4. We ret our loaded value (%1)

That’s not too bad at all! getelementptr behaves exactly like the simple index operation that it models.

What if we change nums to an array? Arrays decompose to pointers in C so it should be the same, right?

1
2
3
4
long nums[] = {1, 2, 3};
long index_first(void) {
    return nums[0];
}

Wrong! LLVM IR isn’t C, and doesn’t play by C’s rules. In particular, it has sized array types that can be cast to pointers, but that don’t decompose on their own:

1
2
3
4
5
6
@nums = dso_local global [3 x i64] [i64 1, i64 2, i64 3], align 16

define dso_local i64 @index_first() #0 {
  %0 = load i64, i64* getelementptr inbounds ([3 x i64], [3 x i64]* @nums, i64 0, i64 0), align 16
  ret i64 %0
}

(View it on Godbolt.)

This is simpler in some ways and more complicated in others: getelementptr is now invoked inline within the load4, but we’ve also saved a load and two SSA variables.

Breaking the getelementptr down:

But wait, there are two i64 0 indexes instead of just one! What gives?

As best I can tell, this is a quirk of the new type of nums: it’s not a pointer anymore, but an array. However, because we’re accessing it through a pointer (because that’s how getelementptr is defined), we actually need two 0 indexes: the first to piece the pointer, and the second to index the array itself. Same operation, slightly more indirection within the IR5.

More indirection

What happens when we add a level of source indirection, like an index variable?

1
2
3
4
5
long nums[] = {1, 2, 3};
long i;
long index_i(void) {
  return nums[i];
}
1
2
3
4
5
6
7
8
9
@nums = dso_local global [3 x i64] [i64 1, i64 2, i64 3], align 16
@i = common dso_local global i64 0, align 8

define dso_local i64 @index_i() #0 {
  %0 = load i64, i64* @i, align 8
  %arrayidx = getelementptr inbounds [3 x i64], [3 x i64]* @nums, i64 0, i64 %0
  %1 = load i64, i64* %arrayidx, align 8
  ret i64 %1
}

(View it on Godbolt.)

It’s nearly identical to our second trivial example! The only thing that’s changed is our last index, going from 0 to %0 (which corresponds to our loaded i). This in turn confirms what we thought about the two i64 0s above: that the first (still present) simply pierces the pointer, while the second performs the “actual” indexing.

Let’s try it with more flavor dimensionality:

1
2
3
4
5
long nums[3][3] = { {1, 2, 3}, {2, 3, 4}, {3, 4, 5} };
long i;
long index_i2(void) {
  return nums[i][i];
}

yields:

1
2
3
4
5
6
7
8
9
@nums = dso_local local_unnamed_addr global [3 x [3 x i64]] [[3 x i64] [i64 1, i64 2, i64 3], [3 x i64] [i64 2, i64 3, i64 4], [3 x i64] [i64 3, i64 4, i64 5]], align 16
@i = common dso_local local_unnamed_addr global i64 0, align 8

define dso_local i64 @index_i2() local_unnamed_addr #0 {
  %0 = load i64, i64* @i, align 8
  %arrayidx1 = getelementptr inbounds [3 x [3 x i64]], [3 x [3 x i64]]* @nums, i64 0, i64 %0, i64 %0
  %1 = load i64, i64* %arrayidx1, align 8
  ret i64 %1
}

(View it on Godbolt.)

Exactly as expected: all we’re doing is indexing one layer deeper into the aggregate with the same index (%0) and, sure, enough, our getelementptr has one additional argument.

Adding a little bit of math makes it clear that the order of the indexes is as we expect:

1
2
3
4
5
long nums[3][3] = { {1, 2, 3}, {2, 3, 4}, {3, 4, 5} };
long i;
long index_i2(void) {
  return nums[i][i + 1];
}

becomes:

1
2
3
4
5
6
7
8
9
10
@nums = dso_local local_unnamed_addr global [3 x [3 x i64]] [[3 x i64] [i64 1, i64 2, i64 3], [3 x i64] [i64 2, i64 3, i64 4], [3 x i64] [i64 3, i64 4, i64 5]], align 16
@i = common dso_local local_unnamed_addr global i64 0, align 8

define dso_local i64 @index_i2() local_unnamed_addr #0 {
  %0 = load i64, i64* @i, align 8
  %add = add nsw i64 %0, 1
  %arrayidx1 = getelementptr inbounds [3 x [3 x i64]], [3 x [3 x i64]]* @nums, i64 0, i64 %0, i64 %add
  %1 = load i64, i64* %arrayidx1, align 8
  ret i64 %1
}

(View it on Godbolt.)

%add boils down to i + 1, as expected.

Structures and other composites

We’ve been doing pointers and arrays so far, but getelementptr is well defined for any aggregate, including structures.

Let’s go all in:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
typedef struct foo {
  struct {
    long field1;
    struct {
      long field2;
      long field3;
      union {
        long field4;
        char field5[32];
      } quux;
      long field6;
      long field7;
    } baz;
    long field8;
  } bar;
} foo;

foo chunky;

char take_field(void) {
  return chunky.bar.baz.quux.field5[17];
}

Once again, LLVM blesses us with a single getelementptr:

1
2
3
4
5
6
7
8
9
10
11
%struct.foo = type { %struct.anon }
%struct.anon = type { i64, %struct.anon.0, i64 }
%struct.anon.0 = type { i64, i64, %union.anon, i64, i64 }
%union.anon = type { i64, [24 x i8] }

@chunky = common dso_local local_unnamed_addr global %struct.foo zeroinitializer, align 8

define dso_local signext i8 @take_field() local_unnamed_addr #0 {
  %0 = load i8, i8* getelementptr inbounds (%struct.foo, %struct.foo* @chunky, i64 0, i32 0, i32 1, i32 2, i32 1, i64 9), align 1
  ret i8 %0
}

(View it on Godbolt)

We’re off to a good start again: we’re addressing into a foo, and the foo that we’re addressing into is the chunky global. Then, in sequence:

What just happened? Where’s our 17? Where the hell did 9 come from?

Let’s look at our union type more closely:

1
%union.anon = type { i64, [24 x i8] }

First of all, it’s not actually a union in LLVM land. LLVM does not have its own union type.

However, if we look more closely, we’ll see that LLVM has engaged in some cleverness: %union.anon is just another structure, but its second field (our field5) is 24 bytes instead of 32 — the first 8 are in field4, just as the C abstract machine expects. Thus, LLVM has correctly modeled C’s union semantics without unions of its own.

So, whence index 9? Well, remember that we’re dealing with a [24 x i8], i.e. the last 24 bytes of field5. In other words, by virtue of our previous i32 1 index, we’ve already moved past the first sizeof(i64) == 8 bytes of field5, and 8 + 9 == 17! Our index! LLVM hasn’t let us down.

But wait: what happens if we try to grab a part of field5 that overlaps with field4, like field5[6]?

1
%0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* bitcast (%union.anon* getelementptr inbounds (%struct.foo, %struct.foo* @chunky, i64 0, i32 0, i32 1, i32 2) to [32 x i8]*), i64 0, i64 6), align 2

So, we have a nested getelementptr expression, but it’s not nearly as bad as it looks:

In other words: we sometimes need two (or more) getelementptrs7 to compute the effective addresses of union fields.

Vectorized address computation

To my embarrassment, the first structure example above took me about 30 minutes to understand. So why stop the beating education there?

As mentioned all the way at the beginning, getelementptr can take and return entire vectors of addresses and indexes.

I spent about an hour trying to get LLVM to emit a vectorized getelementptr, and failed miserably. So I’m just going to contrive a few.

First, let’s look at a basic case: indexing every address in a vector with a constant scalar value:

1
%foo = getelementptr i64, <8 x i64*> %tbl, i64 0

Observe that the first argument is not a vector: it’s still just the type of thing we’re computing addresses for.

The second is where we introduce our vector: 8 i64s. More precisely, it’s a pointer in this context: even with vectors, getelementptr expects its base address to be a pointer.

Finally, our constant scalar 0. This gets multiplexed across every member of %tbl, amounting to:

1
2
3
4
5
6
7
8
long *foo[8];
long tbl[8];

foo[0] = &tbl[0];
foo[1] = &tbl[1];
foo[2] = &tbl[2];
// ...
foo[7] = &tbl[7];

Voilà! 8 address computations in one!

Now let’s do it with a non-constant scalar:

1
%bar = getelementptr i64, <8 x i64*> %tbl, i64 %offset

If you understood the above, then this one shouldn’t be too big of a next step: instead of always grabbing index 0 in %tbl, we’re grabbing whatever value is in %offset. In effect:

1
2
3
4
5
6
7
8
9
long *foo[8];
long tbl[8];
long offset;

foo[0] = &tbl[offset];
foo[1] = &tbl[offset];
foo[2] = &tbl[offset];
// ...
foo[7] = &tbl[offset];

So that’s scalars indexes. What about other vectors?

1
%bar = getelementptr i64, <8 x i64*> %tbl, <8 x i64> %offsets

First, observe that %offsets is the same size as %tbl: they’re both <8 x i64>. This is an invariant of the vectorized form: you can’t use differently sized vectors in getelementptr.

This is because getelementptr is simple: it doesn’t do anything matrix-y, it just maps each member of the base (or intermediate) vector to its corresponding index:

1
2
3
4
5
6
7
8
long *foo[8];
long tbl[8], offsets[8];

foo[0] = &tbl[offsets[0]];
foo[1] = &tbl[offsets[1]];
foo[2] = &tbl[offsets[2]];
// ...
foo[7] = &tbl[offsets[7]];

To cap things off: remember that there’s no limit on the number of index arguments that a getelementptr may have, and that they don’t have to be in any particular order:

1
%bar = getelementptr i64, <8 x i64*> %tbl, <8 x i64> %offsets, i64 %x, <8 x i64> %more_offsets, i64 6

corresponds to:

1
2
3
4
5
6
7
8
9
long *foo[8];
long tbl[8], offsets[8];
long x;

foo[0] = &tbl[offsets[0][x][0][6]];
foo[1] = &tbl[offsets[1][x][1][6]];
foo[2] = &tbl[offsets[2][x][2][6]];
// ...
foo[7] = &tbl[offsets[7][x][3][6]];

Other bits

I haven’t talked at all about some of the possible keywords on the getelementptr instruction, like inbounds and inrange.

In brief:

Wrapup

All told: getelementptr is a little unintuitive and verbose, but not nearly as bad as I thought it was.

It’s also extremely precise: getelementptr tells us exactly what kind of address it’s calculating, where it’s beginning, and the exact “path” through the aggregate that results in a final effective address. It’s generic and flexible, just like LLVM itself.

Now I just need to understand the SelectionDAG.


  1. And, by a significant margin, the only serious open-source competitor against GCC

  2. Very literally: llvm::Function et al. are derived from llvm::Value

  3. You’ll probably notice that we load from an i64**, despite nums being just an i64*. This is more LLVM semantic funniness: load takes a <ty>* for its source, so loading an i64* requires i64**

  4. If you’re a stickler, you’ll point out that these inline GEPs are not technically instructions in LLVM-land, but actually constant expressions (i.e. llvm::GetElementPtrConstantExpr instead of llvm::GetElementPtrInst). But the semantics of the two are identical for my explanatory purposes, so sue me. 

  5. Sure enough, the LLVM docs explain this

  6. I have no clue as to why LLVM switches between i64 and i32 in this GEP. Most LLVM documentation seems to suggest that it just doesn’t matter. 

  7. But again, for the pedants: these are constant expressions instead of “real” GEPs, so they’re essentially free. 

  8. Without inbounds, getelementptr will always return some result address, which may or may not be valid. 


Discussions: Reddit