ENOSUCHBLOG

Programming, philosophy, pedaling.


Basic disassembly with libopcodes

May 18, 2019

Tags: programming, reverse-engineering, security

This will be a brief post on using libopcodes to disassemble a raw (i.e., not in object format) buffer of machine code. The examples will all be for AMD64, but libopcodes should work with most bfd_arch_* and bfd_mach_*-specified machines.

Some background

For the unfamiliar, libopcodes is a part of the GNU binutils. Coupled with libbfd for object format parsing, it provides the core disassembly functionality used by tools like objdump.

It's also very old (the header in my dis-asm.h credits Cygnus Support and dates to 1993) and barely documented: outside of header file comments, the only real reference for it is this random page from 2009 on someone's self-hosted Wiki. opdis and xdisasm appear to use libopcodes, but both also (appear) to be unmaintained.

Why?

Real programmers use libopcodes.

Honestly, there aren't very many good reasons to use libopcodes: Intel's XED is almost certainly more correct, Capstone has a pretty nice API (including decent Python bindings), and Zydis boasts performance and no dependencies as project goals. LLVM also provides disassembler functionality via the MC subproject; Ray Wang has a great blog post on using it.

However, sometimes you just need to do something a particular way. In this case, I needed to use libopcodes. Since there were no other decent resources on it, I figured I'd share what I've learned.

To BFD or not to BFD

libopcodes uses many of libbfd's constants, but can also be populated with a bfd * directly.

This post is not going to cover usage with a BFD handle, since libbfd doesn't do anything for us when disassembling raw bytes directly from an in-memory buffer.

Getting started

Seeing how we're using libopcodes, you'll need to have it installed.

On Debian and Ubuntu, apt install binutils-dev will fetch everything for you.

The syntax for linking to libopcodes is identical to every other library: just pass -lopcodes to your linker.

All code samples below assume that dis-asm.h is included.

Creating a disassembler

Almost all libopcodes functionality revolves around two types: disassembler_ftype and struct disassemble_info.

disassembler_ftype is a typedef'd function pointer, which the user creates and then calls to disassemble a single instruction. dis-asm.h provides some forward declarations for predefined disassembler_ftypes as print_insn_*, but neglects to publicly expose the internal AMD64 disassembler_ftype. As such, we'll need to construct it ourselves.

disassemble_info provides the basic context for feeding data into the user's disassembler_ftype: the stream and callback(s) to use for disassembled output and error reporting, as well as a callback for feeding data into the assembler.

disassemble_info

Creating a new disassemble_info is a multi-step process:

/* disassemble_info has quite a few fields, and we won't be populating all of them.
 *
 * We empty initialize here to so that the libopcodes routines won't try to use
 * garbage data.
 */
struct disassemble_info disasm_info = {};


/* init_disassemble_info takes three arguments:
 *  1. a pointer to our disassemble_info
 *  2. a void pointer to a "stream", which gets fed to...
 *  3. a function pointer to a fprint-like function
 *     see fprintf_type in dis-asm.h for the exact prototype
 */
init_disassemble_info(&disasm_info, stdout, (fprintf_type) fprintf);

We'll replace stdout and fprintf above with our own stream and function in the full example at the end of the post, so that we can capture the disassembly instead of outputting it directly.

Confusingly, init_disassemble_info is not enough to fully initialize our disassemble_info structure. We also need to fill in some fields manually, and call a separate initialization function:

/* Specify our disassembly target. These constants are also required when
 * we create the actual disassembly function later, so I'm not 100% sure
 * if/why they're necessary here.
 */
disasm_info.arch = bfd_arch_i386;
disasm_info.mach = bfd_mach_x86_64;

/* Optionally change the output format to Intel,
 * over the default of AT&T.
 */
disasm_info.disassembler_options = "intel-mnemonic";

/* Tell our disassembler how and where to get its raw bytes.
 * libopcodes provides the buffer_read_memory function;
 * the buffer and its length are our input.
 */
disasm_info.read_memory_func = buffer_read_memory;
disasm_info.buffer = input_buffer;
disasm_info.buffer_vma = 0;
disasm_info.buffer_length = input_buffer_length;

Observe that we set disasm_info.buffer_vma to 0 — you can change that to whatever you want your starting VMA to be. Just make sure to do your address relocations correctly.

Finally, we call one last function:

disassemble_init_for_target(&disasm_info);

Our disassemble_info is now ready for use.

disassembler_ftype

As mentioned above, disassembler_ftype is actually a typedef'd function pointer, one that we will actually call post-creation to disassemble our buffer instruction-by-instruction.

libopcodes provides a disassembler function that returns a suitable function:

disassembler_ftype disasm;

/* disassembler takes 4 arguments:
 *  1. The target architecture, same as disasm_info.arch
 *  2. The endianness (true = big, false = little)
 *  3. The target machine, same as disasm_info.mach
 *  4. An optional pointer to a BFD handle
 *
 * Returns NULL if libopcodes can't find a suitable disassembly function.
 */
disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);

disasm can now be called.

Disassembling an instruction

To disassemble a single instruction, we pass a program counter and our disassemble_info to our disasm function. Internally, this (presumably) causes libopcodes to call its read_memory_func with that program counter as the offset.

/* Our start pc. This should be adjusted per disasm_info.buffer_vma.
 */
size_t pc = 0;

/* disasm() returns the number of bytes consumed during instruction decoding.
 */
size_t insn_size = disasm(pc, &disasm_info);

After a successful call, the buffer specified in disasm_info.buffer should contain a string representation of the disassembled instruction. Note that no newline is appended; it's up to the programmer to ensure that the buffer is human-formatted between calls.

Putting it all together

Here's how we can disassemble a raw buffer into a string of assembly:

#define _GNU_SOURCE /* asprintf, vasprintf */

#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include <dis-asm.h>

typedef struct {
  char *insn_buffer;
  bool reenter;
} stream_state;

/* This approach isn't very memory efficient or clear,
 * but it avoids external size/buffer tracking in this
 * example.
 */
static int dis_fprintf(void *stream, const char *fmt, ...) {
  stream_state *ss = (stream_state *)stream;

  va_list arg;
  va_start(arg, fmt);
  if (!ss->reenter) {
    vasprintf(&ss->insn_buffer, fmt, arg);
    ss->reenter = true;
  } else {
    char *tmp;
    vasprintf(&tmp, fmt, arg);

    char *tmp2;
    asprintf(&tmp2, "%s%s", ss->insn_buffer, tmp);
    free(ss->insn_buffer);
    free(tmp);
    ss->insn_buffer = tmp2;
  }
  va_end(arg);

  return 0;
}

char *disassemble_raw(uint8_t *input_buffer, size_t input_buffer_size) {
  char *disassembled = NULL;
  stream_state ss = {};

  disassemble_info disasm_info = {};
  init_disassemble_info(&disasm_info, &ss, dis_fprintf);
  disasm_info.arch = bfd_arch_i386;
  disasm_info.mach = bfd_mach_x86_64;
  disasm_info.read_memory_func = buffer_read_memory;
  disasm_info.buffer = input_buffer;
  disasm_info.buffer_vma = 0;
  disasm_info.buffer_length = input_buffer_size;
  disassemble_init_for_target(&disasm_info);

  disassembler_ftype disasm;
  disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);

  size_t pc = 0;
  while (pc < input_buffer_size) {
    size_t insn_size = disasm(pc, &disasm_info);
    pc += insn_size;

    if (disassembled == NULL) {
      asprintf(&disassembled, "%s", ss.insn_buffer);
    } else {
      char *tmp;
      asprintf(&tmp, "%s\n%s", disassembled, ss.insn_buffer);
      free(disassembled);
      disassembled = tmp;
    }

    /* Reset the stream state after each instruction decode.
     */
    free(ss.insn_buffer);
    ss.reenter = false;
  }

  return disassembled;
}

int main(int argc, char const *argv[]) {
  uint8_t input_buffer[] = {
      0x55,             /* push rbp */
      0x48, 0x89, 0xe5, /* mov rbp, rsp */
      0x89, 0x7d, 0xfc, /* mov DWORD PTR [rbp-0x4], edi */
      0x8b, 0x45, 0xfc, /* mov eax, DWORD PTR [rbp-0x4] */
      0x0f, 0xaf, 0xc0, /* imul eax, rax */
      0x5d,             /* pop ebp */
      0xc3,             /* ret */
  };
  size_t input_buffer_size = sizeof(input_buffer);

  char *disassembled = disassemble_raw(input_buffer, input_buffer_size);
  puts(disassembled);
  free(disassembled);

  return 0;
}

Which, when compiled and run:

clang test.c -lopcodes -o test
./test

Should produce:

push   %rbp
mov    %rsp,%rbp
mov    %edi,-0x4(%rbp)
mov    -0x4(%rbp),%eax
imul   %eax,%eax
pop    %rbp
retq

Other stuff

The above covers the very basics of using libopcode, but there's a lot of other stuff you can do via disassemble_info: