Static calls in Linux 5.10

I’ve spent a lot of time looking at the Linux kernel, but never directly at its indirect call setup or post-Spectre mitigations. These changes sound very cool, so I’m going to use this post to try and explain and understand them (both to myself and others).

Update: One of the original authors of the patchset has emailed me with some corrections and answers to the questions that I ask below. I’ve marked each with either “Correction” or “Update.” Thanks, Peter!

Background: indirect calls, Spectre, and retpolines

Indirect calls

Indirect calls are one of C’s most powerful language features, and are critical for writing higher-order code without a supplementary object or function/method dispatch system.

Most C programmers are familiar with the basics of indirect calls, thanks to standard and POSIX functions like qsort and pthread_create: each takes a function pointer, which it then calls internally to complete the functionality of the surrounding call:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <stdlib.h>
#include <string.h>
#include <stdio.h>

/* qsort_strcmp is just the normal stdlib strcmp, with a bit of extra parameter
 * munging to match qsort's API.
 */
static int qsort_strcmp(const void *a, const void *b) {
    return strcmp(*(const char **)a, *(const char **)b);
}

int main(void) {
    const char *strings[] = {"foo", "bar", "baz"};

    /* qsort is a generic sorting function:
     * you give it the a pointer to the base address of things to sort,
     * their number and individual sizes, and a *function* that can compare
     * any two members and provide an ordering between them.
     *
     * in this case, we tell qsort to sort an array of strings, using
     * `qsort_strcmp` for the ordering.
     */
    qsort(&strings, 3, sizeof(char *), qsort_strcmp);

    printf("%s %s %s\n", strings[0], strings[1], strings[2]);
    return 0;
}

In this case, the indirect call occurs within qsort. But we can see it directly if we implement our own function that does an indirect call:

Observe: our call goes through a memory or register operand ([rbp - 8])¹ to get the target, instead of a direct target specified by the operand value itself (like, say, call 0xacabacab ; @good_rand). That’s what makes it indirect.

But we can go even further than this! Indeed, a common pattern in C is to declare entire structures of operations, using each to parametrize a lower level set of behaviors (for example, the core POSIX I/O APIs) over independent implementations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
struct fuse_operations {
  int (*getattr) (const char *, struct stat *, struct fuse_file_info *fi);
  int (*readlink) (const char *, char *, size_t);
  int (*mknod) (const char *, mode_t, dev_t);
  int (*mkdir) (const char *, mode_t);
  int (*unlink) (const char *);
  int (*rmdir) (const char *);
  int (*symlink) (const char *, const char *);
  int (*rename) (const char *, const char *, unsigned int flags);
  int (*link) (const char *, const char *);
  /* ... */
  int (*open) (const char *, struct fuse_file_info *);
  int (*read) (const char *, char *, size_t, off_t,
         struct fuse_file_info *);
  int (*write) (const char *, const char *, size_t, off_t,
          struct fuse_file_info *);
  int (*statfs) (const char *, struct statvfs *);
  /* ... */
}

Unsurprisingly, this technique isn’t limited to userspace: the Linux kernel itself makes aggressive use of indirect calls, particularly in architecture-agnostic interfaces (like the VFS and sub-specializations like procfs) and the architecture-specific internals of subsystems like perf_events.

So that’s neat. It’s so neat that CPU engineers got all hot in the pants trying to squeeze extra performance out of them², and we ended up with Spectre v2.

Spectre (v2)

The exact mechanism that Spectre v2 (also known as CVE-2017-5715) exploits is slightly out of scope of this post, but at a high level:

The original Spectre v2 attack focused on cache lines since they’re relatively easy to time, even from high level (and sandboxed!) languages that don’t have access to clflush or other cache-line primitives on x86. But the concept is a general one: it’s difficult to execute speculatively without leaking some information, and subsequent vulnerabilities (like MDS and ZombieLoad) have exposed information leaks in other microarchitectural features.

This is bad news: an attacker running one of the safest contexts (JavaScript or other managed code, in a sandbox, in userspace) can conceivably train the indirect branch predictor to speculatively execute a gadget in kernelspace, potentially disclosing kernel memory.

Retpolines

To mitigate Spectre v2, the kernel needs to prevent the CPU from speculating on an attacker controlled indirect branch.

A retpoline (short for return trampoline) does exactly that: indirect jumps and calls are surrounded by a little thunk that effectively traps the speculated execution in an infinite loop, spinning it until the misprediction is resolved.

This works by converting the indirect control flow from an indirect branch to an indirect return³, hence the “ret” in retpoline. Returns are also predicted, but with an additional mechanism given priority: the Return Stack Buffer⁴. To ensure that the RSB can’t be maliciously trained away from the infinite loop, the retpoline begins with a direct CALL that primes the RSB to always⁵ predict the infinite loop.

Here’s what an indirect call retpoline actually looks like⁶, simplified significantly from the kernel source:

Consequences

Static calls

Let’s check out this new “static call” technique. Here’s the API, direct from the Josh Poimboeuf’s patch series:

and an (abbreviated and annotated) example in perf, pulled from the same series:

That’s pretty clean, and it (apparently) avoids the need for the retpoline! Let’s find out why and how.

How it works

Step 1: DEFINE_STATIC_CALL

On x86, DEFINE_STATIC_CALL(x86_pmu_add, _x86_pmu_add) expands roughly to this:

Specifically, it’s a JMP to the address calculated by .long _x86_pmu_add - (. + 4), which is ~~ugly~~ GAS syntax for “the address of _x86_pmu_add, minus the current address (indicated by .), plus 4”.

Why the arithmetic? It’s a relative JMP, so the destination needs to be made relative to the RIP immediately following the JMP itself. The JMP is 1 byte but we’re using .align 4, so . + 4 ensures that we subtract both the current location and fix up the JMP and padding⁸.

Correction: It turns out I overthought this: the JMP is 5 bytes (1 byte opcode, 4 bytes) displacement, so . + 4 gets us to the RIP immediately after the entire instruction. It has nothing to do with alignment.

So that’s just the setup for our static call. Let’s see how we actually install a function with static_call_update.

Step 2: static_call_update.

We’re just going to ignore BUILD_BUG_ON and __same_type, since they’re a compile-time sanity check. Check out this SO post for BUILD_BUG_ON; it’s fun⁹.

First things first: we’re just going to skip cpus_read_lock and cpus_read_unlock, since they do exactly what they say on the tin.

So, it’s really just a volatile-qualified assign for types that are atomic¹⁰:

That brings us to __static_call_validate which (you guessed it) validates our trampoline:

Remember: tail is always true for us, since we’re not CONFIG_HAVE_STATIC_CALL_INLINE. So insn is always tramp which, you’ll recall, is __SCT__x86_pmu_add_tramp with its cute little .byte 0xe9.

Finally, __static_call_transform (condensed slightly) is where the real magic happens:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
static void __ref __static_call_transform(void *insn, enum insn_type type, void *func)
{
  int size = CALL_INSN_SIZE;
  const void *code;

  switch (type) {
  case CALL:
    code = text_gen_insn(CALL_INSN_OPCODE, insn, func);
    break;
  case NOP:
    code = ideal_nops[NOP_ATOMIC5];
    break;
  case JMP:
    code = text_gen_insn(JMP32_INSN_OPCODE, insn, func);
    break;
  case RET:
    code = text_gen_insn(RET_INSN_OPCODE, insn, func);
    size = RET_INSN_SIZE;
    break;
  }

  if (memcmp(insn, code, size) == 0)
    return;

  if (unlikely(system_state == SYSTEM_BOOTING))
    return text_poke_early(insn, code, size);

  text_poke_bp(insn, code, size, NULL);
}

Remember __sc_insn (I told you to!): tail is always true for us, so our only options are (1, 0) and (1, 1), i.e. JMP and RET. This (again) will matter in a little bit, but the code for both cases is nearly identical so we can ignore the difference.

In both cases we call text_gen_insn, which is the world’s simplest JIT¹² (condensed a bit):

So, our dest (i.e., x86_pmu.add) becomes our relative displacement, just like in our compile-time-generated thunk.

The rest is mechanics: we call either text_poke_early or text_poke_bp based on the system state, but the effect is the same: our trampoline (__SCT__x86_pmu_add_tramp) is effectively rewritten in kernel memory from:

Step 3: static_call_cond

Our actual invocation is static_call_cond(x86_pmu_add)(event) which, as the name implies, should do a check before dispatching to the underlying trampoline, right?

Nope! Recall that little __sc_insn table generator again: if our underlying call is NULL, we’ve already generated and JITted a little RET into our trampoline.

(__ADDRESSABLE is just another bit of compiler trickery to ensure that __SCK__x86_pmu_add sticks around in the symbol table).

That’s everything: our actual static_call boils down to a call to __SCT__x86_pmu_add_tramp, which is either a trampoline that bounces us to the real call or just a RET, completing the call.

Wrapup

This post ended up being way longer than I originally intended, in part because I ended up going further into the why of static calls more than I planned.

Observe that there’s a missed optimization here: [rbp - 8] is just rdi, i.e. the first argument to munge. Clang optimizes this to a register-indirect call rdi with -O1, saving us a memory fetch. ↩
Indirect calls are slower than direct calls by necessity, at least on x86: they require (slightly) more pre-call setup than a normal CALL. They can also result in less-than-optimal code at compile time, since they aren’t (easily) inlinable and are more challenging to model for code size and performance optimizations. ↩
That’s right: our Spectre v2 mitigation is itself a ROP gadget. ↩
Which, as you might have guessed, is its own source of speculative vulnerabilities! See here, here, and here. ↩
Not really: this is a microarchitectural detail, and Intel’s Skylake family in particular can be tricked into using the indirect branch predictor by emptying the RSB. See “Empty RSB Mitigation on Skylake-generation” in Intel’s retpoline whitepaper. ↩
Specifically, this is an “out-of-line” construction that plays well with relocatable code. It’s nearly identical to the one Google offers here, which probably isn’t a coincidence. ↩
“Indirect Branch Restricted Speculation” ↩
This took me about 10 minutes to figure out. I didn’t notice the .align 4 originally and was driving myself crazy trying to justify it. ↩
__same_type, by comparison, is surprisingly boring: it’s just a wrapper for GCC’s __builtin_types_compatible_p. ↩
Or maybe not: compiletime_assert_rwonce_type allows 64-bit reads and writes on 32-bit hosts, relying on “a strong prevailing wind”. ↩
Why void *? Because Because x is static_call_key.func, which is void *. No trickery there. ↩
I’m sure I’m going to annoy someone by calling it a JIT. Mentally replace “JIT” with “Self-modifying code” if you need to. ↩
This was very hard for me to follow. ↩
And maybe more? I didn’t read this part of the code closely. ↩
This was the first time I’d heard of “store-tearing.” Turns out it’s the same as store-splitting, i.e. breaking a particular store op into smaller constituent ops (and thereby potentially erasing some serialization, visibility, or consistency guarantee). ↩
Or maybe a slightly modified version, since poking the .text requires us to twiddle CR0 first. Maybe that’s the reason for the extra locking. Correction: Apparently text_poke_bp doesn’t actually twiddle cr0 at all! There’s some “ungodly magic” at play that I’ll have to look into. ↩

ENOSUCHBLOG

Programming, philosophy, pedaling.

Static calls in Linux 5.10

Dec 16, 2020 Tags: c, curiosity, programming, security, x86

Background: indirect calls, Spectre, and retpolines

Indirect calls

Spectre (v2)

Retpolines

Consequences

Static calls

How it works

Step 1: `DEFINE_STATIC_CALL`

Step 2: `static_call_update`.

Step 3: `static_call_cond`

Wrapup

ENOSUCHBLOG

Programming, philosophy, pedaling.

Static calls in Linux 5.10

Dec 16, 2020 Tags: c, curiosity, programming, security, x86

Background: indirect calls, Spectre, and retpolines

Indirect calls

Spectre (v2)

Retpolines

Consequences

Static calls

How it works

Step 1: DEFINE_STATIC_CALL

Step 2: static_call_update.

Step 3: static_call_cond

Wrapup

Step 1: `DEFINE_STATIC_CALL`

Step 2: `static_call_update`.

Step 3: `static_call_cond`