ENOSUCHBLOG

Programming, philosophy, pedaling.


Locating a compiler bug with git bisection

May 7, 2020

Tags: programming, devblog

This is just a brief post on the steps I took to confirm, locate, and backport the fix for a compiler bug that was plaguing a (professional) project that I work on. It’s a straightforward application of git bisect, with a few observations:

  1. That even compilers have bugs (and that, depending on the context, the compiler might be the most buggy component)
  2. That git bisect is incredibly flexible and can work on extremely complex programs
  3. That backporting a fix isn’t always difficult or fraught

Background: the bug

I work on LLVM-based program analysis tools professionally, and currently work on a project that makes heavy use of a modified build of LLVM with some additional features.

Many of the components that I work on assume that LLVM faithfully emits debug intrinsics (i.e. call void llvm.dbg.whatever) and location expressions (i.e., DILocation, DebugLoc) for every source-level variable1.

This is normally a reasonable assumption, at least for unoptimized or less-optimized compilations, and allows us to make further assumptions about the information we find in DWARF after lowering.

Here’s a source file (minimized from the original) that, when compiled with -O1, violated the above assumption:

#include <stdio.h>
#include <stdlib.h>

struct foo {
  char buffer[32];
};

int bar(int index) {
  struct foo stuff = {};
  if (index >= sizeof(stuff.buffer)) {
      return -1;
  }
  return (int)stuff.buffer[index];
}

int main(int argc, char **argv) {
  return bar(atoi(argv[1]));
}

…and here’s the corresponding LLVM IR for bar:

; Function Attrs: nounwind readnone uwtable
define dso_local i32 @bar(i32) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !17, metadata !DIExpression()), !dbg !25
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !26
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #6, !dbg !26
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %3, i8 0, i64 32, i1 false), !dbg !27
  %4 = icmp ugt i32 %0, 31, !dbg !28
  br i1 %4, label %10, label %5, !dbg !30

; <label>:5:                                      ; preds = %1
  %6 = sext i32 %0 to i64, !dbg !31
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !32
  %8 = load i8, i8* %7, align 1, !dbg !32, !tbaa !33
  %9 = sext i8 %8 to i32, !dbg !36
  br label %10, !dbg !37

; <label>:10:                                     ; preds = %1, %5
  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !38
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #6, !dbg !39
  ret i32 %11, !dbg !39
}

Some things immediately stand out:

All told, this is extremely suspicious: we’re missing an intrinsic, a debug node is incorrectly marked as “retained”, and there’s no connection between our (orphaned) DILocalVariable and its DILocation.

Unsurprisingly, this variable/location disconnect results in broken DWARF information for stuff, specifically in the form of a missing DW_AT_location (which we would expect, based on the alloca, to reflect a stack location):

0x00000043:   DW_TAG_subprogram
                DW_AT_low_pc    (0x00000000004004d0)
                DW_AT_high_pc   (0x00000000004004f0)
                DW_AT_frame_base        (DW_OP_reg7 RSP)
                DW_AT_name      ("bar")
                DW_AT_decl_file ("/tmp/junk.c")
                DW_AT_decl_line (8)
                DW_AT_prototyped        (true)
                DW_AT_type      (0x0000002a "int")
                DW_AT_external  (true)

0x0000005c:     DW_TAG_formal_parameter
                  DW_AT_location        (DW_OP_reg5 RDI)
                  DW_AT_name    ("index")
                  DW_AT_decl_file       ("/tmp/junk.c")
                  DW_AT_decl_line       (8)
                  DW_AT_type    (0x0000002a "int")

0x00000069:     DW_TAG_variable
                  DW_AT_name    ("stuff")
                  DW_AT_decl_file       ("/tmp/junk.c")
                  DW_AT_decl_line       (9)
                  DW_AT_type    (0x000000ad "foo")

First steps: reproducing

Since we’re dealing with a modified LLVM 7 here, the very first step is to make sure that we’re not the cause of our own problem.

A quick trip to Godbolt settles that, showing the same missing intrinsic on a reference build of LLVM/Clang 7:

define dso_local i32 @bar(i32) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !18, metadata !DIExpression()), !dbg !26
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !27
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #6, !dbg !27
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %3, i8 0, i64 32, i1 false), !dbg !28
  %4 = icmp ugt i32 %0, 31, !dbg !29
  br i1 %4, label %10, label %5, !dbg !31

  %6 = sext i32 %0 to i64, !dbg !32
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !33
  %8 = load i8, i8* %7, align 1, !dbg !33, !tbaa !34
  %9 = sext i8 %8 to i32, !dbg !37
  br label %10, !dbg !38

  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !39
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #6, !dbg !40
  ret i32 %11, !dbg !40
}

(Permalink)

Finding the fix

At this point, the panic settles in: there’s a bug in the compiler that isn’t (directly) our fault, and we need to fix it in order to continue.

Compilers are complicated beasts, and we’d rather not find and fix bugs ourselves if we don’t have to. So, we return to Godbolt and see whether someone else has done the hard work for us sometime between LLVM 7 (released 09/2019) and the latest trunk build.

Indeed! Compiling with a trunk build (permalink) shows two debug intrinsics, as expected, with correctly attached location metadata on each call:

define dso_local i32 @bar(i32 %0) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !18, metadata !DIExpression()), !dbg !26
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !27
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #7, !dbg !27
  call void @llvm.dbg.declare(metadata %struct.foo* %2, metadata !19, metadata !DIExpression()), !dbg !28
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 dereferenceable(32) %3, i8 0, i64 32, i1 false), !dbg !28
  %4 = icmp ugt i32 %0, 31, !dbg !29
  br i1 %4, label %10, label %5, !dbg !31

5:                                                ; preds = %1
  %6 = sext i32 %0 to i64, !dbg !32
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !33
  %8 = load i8, i8* %7, align 1, !dbg !33, !tbaa !34
  %9 = sext i8 %8 to i32, !dbg !37
  br label %10, !dbg !38

10:                                               ; preds = %1, %5
  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !26
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #7, !dbg !39
  ret i32 %11, !dbg !39
}

So let’s walk backwards.

LLVM 10 also emits two intrinsics, but LLVM 9 is identical to our buggy LLVM 7 IR:

define dso_local i32 @bar(i32) local_unnamed_addr #0 !dbg !13 {
  %2 = alloca %struct.foo, align 1
  call void @llvm.dbg.value(metadata i32 %0, metadata !18, metadata !DIExpression()), !dbg !26
  %3 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 0, !dbg !27
  call void @llvm.lifetime.start.p0i8(i64 32, i8* nonnull %3) #6, !dbg !27
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %3, i8 0, i64 32, i1 false), !dbg !28
  %4 = icmp ugt i32 %0, 31, !dbg !29
  br i1 %4, label %10, label %5, !dbg !31

5:                                                ; preds = %1
  %6 = sext i32 %0 to i64, !dbg !32
  %7 = getelementptr inbounds %struct.foo, %struct.foo* %2, i64 0, i32 0, i64 %6, !dbg !33
  %8 = load i8, i8* %7, align 1, !dbg !33, !tbaa !34
  %9 = sext i8 %8 to i32, !dbg !37
  br label %10, !dbg !38

10:                                               ; preds = %1, %5
  %11 = phi i32 [ %9, %5 ], [ -1, %1 ], !dbg !26
  call void @llvm.lifetime.end.p0i8(i64 32, i8* nonnull %3) #6, !dbg !39
  ret i32 %11, !dbg !39
}

So we know that someone fixed this bug, somewhere between the LLVM 9 (09/2019) and LLVM 10 (03/2020) releases. That’s reassuring, but doesn’t get us any closer to actually patching our build. More work is needed.

Finding the fix, for real

Knowing that our desired bugfix is somewhere in the changeset between LLVM 9 and LLVM 10 leaves us with a few naive strategies:

Neither of these is particularly appealing: the unstructured nature of commit messages makes it difficult to ensure that we aren’t over- or under-constraining our search, and individually compiling all 10,000+ commits between the two releases would take us weeks.

git bisect to the rescue

git bisect, unsurprisingly, is the “right” answer to exactly this problem. From the manpage:

This command uses a binary search algorithm to find which commit in your project’s history introduced a bug. You use it by first telling it a “bad” commit that is known to contain the bug, and a “good” commit that is known to be before the bug was introduced. Then git bisect picks a commit between those two endpoints and asks you whether the selected commit is “good” or “bad”. It continues narrowing down the range until it finds the exact commit that introduced the change.

We’re going forwards in history and looking for a bugfix rather than a regression, so the good and bad identifiers can be a little confusing. To avoid confusing ourselves too much, we use git bisect’s alternative old and new terms instead.

We also, of course, need to actually run something on each bisected revision to determine whether it’s old or new. The LLVM project build is, blessedly, straightforward:

$ mkdir build && cd build
$ cmake -G Ninja -DLLVM_TARGETS_TO_BUILD="X86" -DLLVM_ENABLE_PROJECTS="clang" ../llvm
$ cmake --build .
$ ./bin/clang --version
$ ./bin/clang -O1 -g /tmp/testcase.c -o /tmp/testcase
# the -n flag filters DWARF info on DW_AT_name, which in our case is the "stuff" variable
$ ./bin/llvm-dwarfdump -n stuff /tmp/testcase | grep 'DW_AT_location'

(We use Ninja and restrict our backends to x86 for faster compilations; there are other definitions we could add to prune the build but these were sufficient for my purposes.)

Finally, we actually perform the bisection. From the repo root, we begin with:

$ git checkout llvmorg-9.0.0
$ git bisect start
# we know that 9.0 has the bug and 10.0 doesn't, so we can tag them immediately
$ git bisect old
$ git bisect new llvmorg-10.0.0

then, perform our actual test steps on the current revision:

$ cd build
# leverage the CMake cache to speed things up, if it's still valid
$ cmake --build .
$ ./bin/clang -O1 -g /tmp/testcase.c -o /tmp/testcase
$ ./bin/llvm-dwarfdump -n stuff /tmp/testcase | grep 'DW_AT_location'
$ old_or_new=$([[ $? -eq 0 ]] && echo new || echo old)
$ cd ..
$ git bisect ${old_or_new}

…and repeat until no commits are left:

cbf1f3b771c8: [Debuginfo][SROA] Need to handle dbg.value in SROA pass.

diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index a1d1b3ec84f0..7242eb1d3de2 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -1378,7 +1378,12 @@ void llvm::ConvertDebugDeclareToDebugValue(DbgVariableIntrinsic *DII,
 /// Determine whether this alloca is either a VLA or an array.
 static bool isArray(AllocaInst *AI) {
   return AI->isArrayAllocation() ||
-    AI->getType()->getElementType()->isArrayTy();
+         (AI->getAllocatedType() && AI->getAllocatedType()->isArrayTy());
+}
+
+/// Determine whether this alloca is a structure.
+static bool isStructure(AllocaInst *AI) {
+  return AI->getAllocatedType() && AI->getAllocatedType()->isStructTy();
 }

 /// LowerDbgDeclare - Lowers llvm.dbg.declare intrinsics into appropriate set
@@ -1403,7 +1408,7 @@ bool llvm::LowerDbgDeclare(Function &F) {
     // stored on the stack, while the dbg.declare can only describe
     // the stack slot (and at a lexical-scope granularity). Later
     // passes will attempt to elide the stack slot.
-    if (!AI || isArray(AI))
+    if (!AI || isArray(AI) || isStructure(AI))
       continue;

     // A volatile load/store means that the alloca can't be elided anyway

To explain: the LLVM optimizer includes a pass named SRoA, short for “Scalar Replacement of Aggregates”.

SRoA is run by default on -O1 and higher, and attempts to replace aggregate values (like arrays and structures) with scalar SSA values2.

An effective SRoA run requires that the llvm.dbg.declare for a value be lowered to one or more llvm.dbg.values, as the SSA value may or may not reside at a memory location (it might be in a register, like index).

The bug occurs because of this lowering behavior: the relevant helper (LowerDbgDeclare) fails to check for one of the valid aggregate types (structures) and incorrectly elides the intrinsic without inserting any replacements3. The fix is fundamentally a single line; the changes to isArray and addition of isStructure only serve to make it more readable.

Five minutes later, we have a patched LLVM 7 with the correct behavior:

$ ./bin/clang --version
clang version 7.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /store/william/llvm-project/build/./bin

$ ./bin/clang -O1 -g /tmp/testcase.c -o /tmp/testcase
$ ./bin/llvm-dwarfdump -n stuff /tmp/testcase | grep 'DW_AT_location'
/tmp/testcase:  file format elf64-x86-64

0x00000069: DW_TAG_variable
              DW_AT_location    (DW_OP_fbreg -120)
              DW_AT_name        ("stuff")
              DW_AT_decl_file   ("/tmp/testcase.c")
              DW_AT_decl_line   (9)
              DW_AT_type        (0x00000078 "foo")
  1. As well as common “artificial” variables, like implicit *this in C++. 

  2. This is a little jargon heavy. John Regehr summarizes SRoA’s behavior in this post, but again at a high level: SRoA might take a stack allocation like int32_t x[2] and replace it with two independent SSA registers representing x[0] and x[1]

  3. Said replacements would be incorrect anyways in this case, since our aggregate can’t be SRoA’d.