May 18, 2019 Tags: c, programming, reverse-engineering, security
This will be a brief post on using libopcodes
to disassemble a raw (i.e., not in object
format) buffer of machine code. The examples will all be for AMD64, but libopcodes
should
work with most bfd_arch_*
and bfd_mach_*
-specified machines.
For the unfamiliar, libopcodes
is a part of the
GNU binutils.
Coupled with libbfd
for object format parsing, it provides the core disassembly functionality
used by tools like objdump
.
It’s also very old (the header in my dis-asm.h
credits
Cygnus Support and dates to 1993) and barely
documented: outside of header file comments, the only real reference for it is
this random page from 2009 on
someone’s self-hosted Wiki. opdis and
xdisasm appear to use libopcodes
, but both also (appear)
to be unmaintained.
Honestly, there aren’t very many good reasons to use libopcodes
: Intel’s
XED is almost certainly more correct,
Capstone has a pretty nice API
(including decent Python bindings), and
Zydis boasts performance and no dependencies as project
goals. LLVM also provides disassembler functionality via the MC subproject;
Ray Wang has a
great blog post on using it.
However, sometimes you just need to do something a particular way. In this case, I needed
to use libopcodes
. Since there were no other decent resources on it, I figured I’d share what
I’ve learned.
libopcodes
uses many of libbfd
’s constants, but can also be populated with a bfd *
directly.
This post is not going to cover usage with a BFD handle, since libbfd
doesn’t do anything
for us when disassembling raw bytes directly from an in-memory buffer.
Seeing how we’re using libopcodes
, you’ll need to have it installed.
On Debian and Ubuntu, apt install binutils-dev
will fetch everything for you.
The syntax for linking to libopcodes
is identical to every other library: just pass
-lopcodes
to your linker.
All code samples below assume that dis-asm.h
is included.
Almost all libopcodes
functionality revolves around two types: disassembler_ftype
and
struct disassemble_info
.
disassembler_ftype
is a typedef
‘d function pointer, which the user creates and then
calls to disassemble a single instruction. dis-asm.h
provides some forward declarations
for predefined disassembler_ftype
s as print_insn_*
, but neglects to publicly expose
the internal AMD64 disassembler_ftype
. As such, we’ll need to construct it ourselves.
disassemble_info
provides the basic context for feeding data into the user’s disassembler_ftype
:
the stream and callback(s) to use for disassembled output and error reporting, as well as
a callback for feeding data into the assembler.
disassemble_info
Creating a new disassemble_info
is a multi-step process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* disassemble_info has quite a few fields, and we won't be populating all of them.
*
* We empty initialize here to so that the libopcodes routines won't try to use
* garbage data.
*/
struct disassemble_info disasm_info = {};
/* init_disassemble_info takes three arguments:
* 1. a pointer to our disassemble_info
* 2. a void pointer to a "stream", which gets fed to...
* 3. a function pointer to a fprint-like function
* see fprintf_type in dis-asm.h for the exact prototype
*/
init_disassemble_info(&disasm_info, stdout, (fprintf_type) fprintf);
We’ll replace stdout
and fprintf
above with our own stream and function in the full example
at the end of the post, so that we can capture the disassembly instead of outputting it directly.
Confusingly, init_disassemble_info
is not enough to fully initialize our
disassemble_info
structure. We also need to fill in some fields manually, and call a
separate initialization function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* Specify our disassembly target. These constants are also required when
* we create the actual disassembly function later, so I'm not 100% sure
* if/why they're necessary here.
*/
disasm_info.arch = bfd_arch_i386;
disasm_info.mach = bfd_mach_x86_64;
/* Optionally change the output format to Intel,
* over the default of AT&T.
*/
disasm_info.disassembler_options = "intel-mnemonic";
/* Tell our disassembler how and where to get its raw bytes.
* libopcodes provides the buffer_read_memory function;
* the buffer and its length are our input.
*/
disasm_info.read_memory_func = buffer_read_memory;
disasm_info.buffer = input_buffer;
disasm_info.buffer_vma = 0;
disasm_info.buffer_length = input_buffer_length;
Observe that we set disasm_info.buffer_vma
to 0
— you can change that to whatever you
want your starting VMA to be. Just make sure to do your address relocations correctly.
Finally, we call one last function:
1
disassemble_init_for_target(&disasm_info);
Our disassemble_info
is now ready for use.
disassembler_ftype
As mentioned above, disassembler_ftype
is actually a typedef
‘d function pointer, one
that we will actually call post-creation to disassemble our buffer instruction-by-instruction.
libopcodes
provides a disassembler
function that returns a suitable function:
1
2
3
4
5
6
7
8
9
10
11
disassembler_ftype disasm;
/* disassembler takes 4 arguments:
* 1. The target architecture, same as disasm_info.arch
* 2. The endianness (true = big, false = little)
* 3. The target machine, same as disasm_info.mach
* 4. An optional pointer to a BFD handle
*
* Returns NULL if libopcodes can't find a suitable disassembly function.
*/
disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);
disasm
can now be called.
To disassemble a single instruction, we pass a program counter and our disassemble_info
to our disasm
function. Internally, this (presumably) causes libopcodes
to call its
read_memory_func
with that program counter as the offset.
1
2
3
4
5
6
7
/* Our start pc. This should be adjusted per disasm_info.buffer_vma.
*/
size_t pc = 0;
/* disasm() returns the number of bytes consumed during instruction decoding.
*/
size_t insn_size = disasm(pc, &disasm_info);
After a successful call, the buffer specified in disasm_info.buffer
should contain
a string representation of the disassembled instruction. Note that no newline is appended;
it’s up to the programmer to ensure that the buffer is human-formatted between calls.
Here’s how we can disassemble a raw buffer into a string of assembly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
#define _GNU_SOURCE /* asprintf, vasprintf */
#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <dis-asm.h>
typedef struct {
char *insn_buffer;
bool reenter;
} stream_state;
/* This approach isn't very memory efficient or clear,
* but it avoids external size/buffer tracking in this
* example.
*/
static int dis_fprintf(void *stream, const char *fmt, ...) {
stream_state *ss = (stream_state *)stream;
va_list arg;
va_start(arg, fmt);
if (!ss->reenter) {
vasprintf(&ss->insn_buffer, fmt, arg);
ss->reenter = true;
} else {
char *tmp;
vasprintf(&tmp, fmt, arg);
char *tmp2;
asprintf(&tmp2, "%s%s", ss->insn_buffer, tmp);
free(ss->insn_buffer);
free(tmp);
ss->insn_buffer = tmp2;
}
va_end(arg);
return 0;
}
char *disassemble_raw(uint8_t *input_buffer, size_t input_buffer_size) {
char *disassembled = NULL;
stream_state ss = {};
disassemble_info disasm_info = {};
init_disassemble_info(&disasm_info, &ss, dis_fprintf);
disasm_info.arch = bfd_arch_i386;
disasm_info.mach = bfd_mach_x86_64;
disasm_info.read_memory_func = buffer_read_memory;
disasm_info.buffer = input_buffer;
disasm_info.buffer_vma = 0;
disasm_info.buffer_length = input_buffer_size;
disassemble_init_for_target(&disasm_info);
disassembler_ftype disasm;
disasm = disassembler(bfd_arch_i386, false, bfd_mach_x86_64, NULL);
size_t pc = 0;
while (pc < input_buffer_size) {
size_t insn_size = disasm(pc, &disasm_info);
pc += insn_size;
if (disassembled == NULL) {
asprintf(&disassembled, "%s", ss.insn_buffer);
} else {
char *tmp;
asprintf(&tmp, "%s\n%s", disassembled, ss.insn_buffer);
free(disassembled);
disassembled = tmp;
}
/* Reset the stream state after each instruction decode.
*/
free(ss.insn_buffer);
ss.reenter = false;
}
return disassembled;
}
int main(int argc, char const *argv[]) {
uint8_t input_buffer[] = {
0x55, /* push rbp */
0x48, 0x89, 0xe5, /* mov rbp, rsp */
0x89, 0x7d, 0xfc, /* mov DWORD PTR [rbp-0x4], edi */
0x8b, 0x45, 0xfc, /* mov eax, DWORD PTR [rbp-0x4] */
0x0f, 0xaf, 0xc0, /* imul eax, rax */
0x5d, /* pop ebp */
0xc3, /* ret */
};
size_t input_buffer_size = sizeof(input_buffer);
char *disassembled = disassemble_raw(input_buffer, input_buffer_size);
puts(disassembled);
free(disassembled);
return 0;
}
Which, when compiled and run:
1
2
clang test.c -lopcodes -o test
./test
Should produce:
1
2
3
4
5
6
7
push %rbp
mov %rsp,%rbp
mov %edi,-0x4(%rbp)
mov -0x4(%rbp),%eax
imul %eax,%eax
pop %rbp
retq
The above covers the very basics of using libopcode
, but there’s a lot of
other stuff you can do via disassemble_info
:
For some targets (not x86, unfortunately), the decoder will set insn_info_valid
. If set,
the branch_delay_insns
, data_size
, insn_type
, target
, and target2
fields can all
be accessed. See the header for more information about each.
Memory error reporting can be controlled via memory_error_func
. libopcodes
provides
perror_memory
as a default choice, if set by the user.
Address printing can be controlled via print_address_func
. libopcodes
provides
generic_print_address
as a default choice.
Symbol resolution can be controlled via symbol_at_address_func
and symbol_is_valid
.
libopcode
provides generic_symbol_at_address
and generic_symbol_is_valid
as default choices.