Understanding static single assignment forms

With thanks to Niki Carroll, winny, and kurufu for their invaluable proofreading and advice.

Preword

By popular demand, I’m doing another LLVM post. This time, it’s single static assignment (or SSA) form, a common feature in the intermediate representations of optimizing compilers.

Like the last one, SSA is a topic in compiler and IR design that I mostly understand but could benefit from some self-guided education on. So here we are.

How to represent a program

At the highest level, a compiler’s job is singular: to turn some source language input into some machine language output. Internally, this breaks down into a sequence of clearly delineated¹ tasks:

In a single-pass compiler, (4) is monolithic: machine code is generated as the compiler walks the AST, with no revisiting of previously generated code. This is extremely fast (in terms of compiler performance) in exchange for some a few significant limitations:

So, we want an IR that’s easy to correctly transform and that’s amenable to optimization. Let’s talk about why IRs that have the static single assignment property fill that niche.

SSA form

At its core, the SSA form of any program source program introduces only one new constraint: all variables are assigned (i.e., stored to) exactly once.

By way of example: the following (not actually very helpful) function is not in a valid SSA form with respect to the flags variable:

Why? Because flags is stored to twice: once for initialization, and (potentially) again inside the conditional body.

As programmers, we could rewrite helpful_open to only ever store once to each automatic variable:

But this is clumsy and repetitive: we essentially need to duplicate every chain of uses that follow any variable that is stored to more than once. That’s not great for readability, maintainability, or code size.

So, we do what we always do: make the compiler do the hard work for us. Fortunately there exists a transformation from every valid program into an equivalent SSA form, conditioned on two simple rules.

Using rule #1 and the example above, we can rewrite flags using _N suffixes to indicate versions:

1
2
3
4
5
6
7
8
9
10
11
12
13
int helpful_open(char *fname) {
  int flags_0 = O_RDWR;

  // Declared up here to avoid dealing with C scopes.
  int flags_1;
  if (!access(fname, F_OK)) {
    flags_1 = flags_0 | O_CREAT;
  }

  int fd = open(fname, flags_1, 0644);

  return fd;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int helpful_open(char *fname) {
    int flags_0 = O_RDWR;

    // Declared up here to avoid dealing with C scopes.
    int flags_1;
    if (!access(fname, F_OK)) {
      flags_1 = flags_0 | O_CREAT;
    }

    int flags_2 = φ(flags_0, flags_1);

    int fd = open(fname, flags_2, 0644);

    return fd;
}

Our quandary is resolved: open always takes flags_2, where flags_2 is a fresh SSA variable produced applying φ to flags_0 and flags_1.

Observe, too, that φ is a symbolic function: compilers that use SSA forms internally do not emit real φ functions in generated code⁵. φ exists solely to reconcile rule #1 with the existence of control flow.

As such, it’s a little bit silly to talk about SSA forms with C examples (since C and other high-level languages are what we’re translating from in the first place). Let’s dive into how LLVM’s IR actually represents them.

SSA in LLVM

First of all, let’s see what happens when we run our very first helpful_open through clang with no optimizations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
define dso_local i32 @helpful_open(i8* %fname) #0 {
entry:
  %fname.addr = alloca i8*, align 8
  %flags = alloca i32, align 4
  %fd = alloca i32, align 4
  store i8* %fname, i8** %fname.addr, align 8
  store i32 2, i32* %flags, align 4
  %0 = load i8*, i8** %fname.addr, align 8
  %call = call i32 @access(i8* %0, i32 0) #4
  %tobool = icmp ne i32 %call, 0
  br i1 %tobool, label %if.end, label %if.then

if.then:                                          ; preds = %entry
  %1 = load i32, i32* %flags, align 4
  %or = or i32 %1, 64
  store i32 %or, i32* %flags, align 4
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  %2 = load i8*, i8** %fname.addr, align 8
  %3 = load i32, i32* %flags, align 4
  %call1 = call i32 (i8*, i32, ...) @open(i8* %2, i32 %3, i32 420)
  store i32 %call1, i32* %fd, align 4
  %4 = load i32, i32* %fd, align 4
  ret i32 %4
}

So, we call open with %3, which comes from…a load from an i32* named %flags? Where’s the φ?

This is something that consistently slips me up when reading LLVM’s IR: only values, not memory, are in SSA form. Because we’ve compiled with optimizations disabled, %flags is just a stack slot that we can store into as many times as we please, and that’s exactly what LLVM has elected to do above.

As such, LLVM’s SSA-based optimizations aren’t all that useful when passed IR that makes direct use of stack slots. We want to maximize our use of SSA variables, whenever possible, to make future optimization passes as effective as possible.

1
2
3
4
5
6
7
8
define dso_local i32 @helpful_open(i8* nocapture readonly %fname) local_unnamed_addr #0 {
entry:
  %call = call i32 @access(i8* %fname, i32 0) #4
  %tobool.not = icmp eq i32 %call, 0
  %spec.select = select i1 %tobool.not, i32 66, i32 2
  %call1 = call i32 (i8*, i32, ...) @open(i8* %fname, i32 %spec.select, i32 420) #4, !dbg !22
  ret i32 %call1, !dbg !23
}

Foiled again! Our stack slots are gone thanks to mem2reg, but LLVM has actually optimized too far: it figured out that our flags value is wholly dependent on the return value of our access call and erased the conditional entirely.

So we need a better example. Let’s do something that LLVM can’t trivially optimize into a select (or sequence of selects), like adding an else if with a function that we’ve only provided the declaration for:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
size_t filesize(char *);

int helpful_open(char *fname) {
  int flags = O_RDWR;

  if (!access(fname, F_OK)) {
    flags |= O_CREAT;
  } else if (filesize(fname) > 0) {
    flags |= O_TRUNC;
  }

  int fd = open(fname, flags, 0644);

  return fd;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
define dso_local i32 @helpful_open(i8* %fname) local_unnamed_addr #0 {
entry:
  %call = call i32 @access(i8* %fname, i32 0) #5
  %tobool.not = icmp eq i32 %call, 0
  br i1 %tobool.not, label %if.end4, label %if.else

if.else:                                          ; preds = %entry
  %call1 = call i64 @filesize(i8* %fname) #5
  %cmp.not = icmp eq i64 %call1, 0
  %spec.select = select i1 %cmp.not, i32 2, i32 514
  br label %if.end4

if.end4:                                          ; preds = %if.else, %entry
  %flags.0 = phi i32 [ 66, %entry ], [ %spec.select, %if.else ]
  %call5 = call i32 (i8*, i32, ...) @open(i8* %fname, i32 %flags.0, i32 420) #5
  ret i32 %call5
}

LLVM’s phi is slightly more complicated than the φ(flags_0, flags_1) that I made up before, but not by much: it takes a list of pairs (two, in this case), with each pair containing a possible value and that value’s originating basic block (which, by construction, is always a predecessor block in the context of the φ node).

Observe, too, that LLVM is still being clever: one of our φ choices is a computed select (%spec.select), so LLVM still managed to partially erase the original control flow.

So that’s cool. But there’s a piece of control flow that we’ve conspicuously ignored.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
define dso_local i32 @do_math(i32 %count, i32 %base) local_unnamed_addr #0 {
entry:
  %cmp5 = icmp sgt i32 %count, 0
  br i1 %cmp5, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.body, %entry
  %base.addr.0.lcssa = phi i32 [ %base, %entry ], [ %add, %for.body ]
  ret i32 %base.addr.0.lcssa

for.body:                                         ; preds = %entry, %for.body
  %i.07 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
  %base.addr.06 = phi i32 [ %add, %for.body ], [ %base, %entry ]
  %add = shl nsw i32 %base.addr.06, 1
  %inc = add nuw nsw i32 %i.07, 1
  %exitcond.not = icmp eq i32 %inc, %count
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !26
}

The rest of the IR is bookkeeping: we need separate SSA variables to compute the addition (%add), increment (%inc), and exit check (%exitcond.not) with each loop iteration.

So what?

So now we know what an SSA form is, and how LLVM represents them⁶. Why should we care?

As I briefly alluded to early in the post, it comes down to optimization potential: the SSA forms of programs are particularly suited to a number of effective optimizations.

Dead code elimination

One of the simplest things that an optimizing compiler can do is remove code that cannot possibly be executed. This makes the resulting binary smaller (and usually faster, since more of it can fit in the instruction cache).

“Dead” code falls into several categories⁷, but a common one is assignments that cannot affect program behavior, like redundant initialization:

Without an SSA form, an optimizing compiler would need to check whether any use of x reaches its original definition (x = 100). Tedious. In SSA form, the impossibility of that is obvious:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int main(void) {
  int x_0 = 100;

  // Just ignore the scoping. Computers aren't real life.
  if (rand() % 2) {
    int x_1 = 200;
  } else if (rand() % 2) {
    int x_2 = 300;
  } else {
    int x_3 = 400;
  }

  return φ(x_1, x_2, x_3);
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
define dso_local i32 @main() local_unnamed_addr #0 {
entry:
  %call = call i32 @rand() #3
  %0 = and i32 %call, 1
  %tobool.not = icmp eq i32 %0, 0
  br i1 %tobool.not, label %if.else, label %if.end6

if.else:                                          ; preds = %entry
  %call1 = call i32 @rand() #3
  %1 = and i32 %call1, 1
  %tobool3.not = icmp eq i32 %1, 0
  %. = select i1 %tobool3.not, i32 400, i32 300
  br label %if.end6

if.end6:                                          ; preds = %if.else, %entry
  %x.0 = phi i32 [ 200, %entry ], [ %., %if.else ]
  ret i32 %x.0
}

Constant propagation

Compilers can also optimize a program by substituting uses of a constant variable for the constant value itself. Let’s take a look at another blob of C:

As humans, we can see that y and z are trivially assigned and never modified⁸. For the compiler, however, this is a variant of the reaching definition problem from above: before it can replace y and z with 7 and 10 respectively, it needs to make sure that y and z are never assigned a different value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int some_math(int x) {
  int y_0 = 7;
  int z_0 = 10;
  int a_0;

  if (rand() % 2) {
    int a_1 = y_0 + z_0;
  } else if (rand() % 2) {
    int a_2 = y_0 + z_0;
  } else {
    int a_3 = y_0 - z_0;
  }

  int a_4 = φ(a_1, a_2, a_3);
  return x + a_4;
}

This is virtually identical to our original form, but with one critical difference: the compiler can now see that every load of y and z is the original assignment. In other words, they’re all safe to replace!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int some_math(int x) {
  int y = 7;
  int z = 10;
  int a_0;

  if (rand() % 2) {
    int a_1 = 7 + 10;
  } else if (rand() % 2) {
    int a_2 = 7 + 10;
  } else {
    int a_3 = 7 - 10;
  }

  int a_4 = φ(a_1, a_2, a_3);
  return x + a_4;
}

So we’ve gotten rid of a few potential register operations, which is nice. But here’s the really critical part: we’ve set ourselves up for several other optimizations:

This is the real magic of an optimizing compiler: each individual optimization is simple and largely independent, but together they produce a virtuous cycle that can be repeated until gains diminish.

Register allocation (alternatively: register scheduling) is less of an optimization itself, and more of an unavoidable problem in compiler engineering: it’s fun to pretend to have access to an infinite number of addressable variables, but the compiler eventually insists that we boil our operations down to a small, fixed set of CPU registers.

The constraints and complexities of register allocation vary by architecture: x86 (prior to AMD64) is notoriously starved for registers⁹ (only 8 full general purpose registers, of which 6 might be usable within a function’s scope¹⁰), while RISC architectures typically employ larger numbers of registers to compensate for the lack of register-memory operations.

Just as above, reductions to SSA form have both indirect and direct advantages for the register allocator:

Wrap-up

There are many things about SSA that I didn’t cover in this post: dominance frontiers, tradeoffs between “pruned” and less optimal SSA forms, and feedback mechanisms between the SSA form of a program and the compiler’s decision to cease optimizing, among others. Each of these could be its own blog post, and maybe will be in the future!

In the sense that each task is conceptually isolated and has well-defined inputs and outputs. Individual compilers have some flexibility with respect to whether they combine or further split the tasks. ↩
The distinction between an AST and an intermediate representation is hazy: Rust converts their AST to HIR early in the compilation process, and languages can be designed to have ASTs that are amendable to analyses that would otherwise be best on an IR. ↩
This can be broken up into lexical validation (e.g. use of an undeclared identifier) and semantic validation (e.g. incorrect initialization of a type). ↩
This is what LLVM does: LLVM IR is lowered to MIR (not to be confused with Rust’s MIR), which is subsequently lowered to machine code. ↩
Not because they can’t: the SSA form of a program can be executed by evaluating φ with concrete control flow. ↩
We haven’t talked at all about minimal or pruned SSAs, and I don’t plan on doing so in this post. The TL;DR of them: naïve SSA form generation can lead to lots of unnecessary φ nodes, impeding analyses. LLVM (and GCC, and anything else that uses SSAs probably) will attempt to translate any initial SSA form into one with a minimally viable number of φs. For LLVM, this tied directly to the rest of mem2reg. ↩
Including removing code that has undefined behavior in it, since “doesn’t run at all” is a valid consequence of invoking UB. ↩
And are also function scoped, meaning that another translation unit can’t address them. ↩
x86 makes up for this by not being a load-store architecture: many instructions can pay the price of a memory round-trip in exchange for saving a register. ↩
Assuming that %esp and %ebp are being used by the compiler to manage the function’s frame. ↩
LLVM, for example, lowers all φs as one of its very first preparations for register allocation. See this 2009 LLVM Developers’ Meeting talk. ↩
Wimmer 2010a: “Linear Scan Register Allocation on SSA Form” (PDF) ↩
Hack 2005: “Towards Register Allocation for Programs in SSA-form” (PDF) ↩

ENOSUCHBLOG

Programming, philosophy, pedaling.