ENOSUCHBLOG

Programming, philosophy, pedaling.


Playing with Apple's weird compression formats

Jun 1, 2021     Tags: c, devblog, programming, ruby, ruby-macho    

This post is at least a year old.

Instead of using my Memorial Day to relax, I ended up going down a little bit of a rabbit hole while trying to add some features to ruby-macho.

The end result is twofold: I know more about Apple’s custom compression formats that I ever wanted to, and I’ve written a new Ruby library (my first in a while!).

I’ll go through the journey below.

TL;DR for those who don’t want to read the whole thing: it’s a Ruby binding for Apple’s reference implementation of the LZFSE and LZVN compression schemes.

Apple uses these schemes throughout their software, including within a variant of the Mach-O format; with these bindings, I’ll be able to progressively enhance ruby-macho’s support for prelinked kernels. I’ll also be able to explore some other bits of Apple’s ecosystem that make use of LZVN, like HFS+ and APFS.

Background

Mach-O is an object format1, best known for being the format used by macOS and iOS for individual object files, executables, and various forms of shared objects2.

As far as object formats go, it’s a relatively reasonable one: it uses TLV-encoded “load commands” for most of its metadata, and keeps unnecessary indirection to a minimum.

The vast majority of my familiarity with Mach-O dates back to 2015 and 2016, when I was a GSoC student for the Homebrew project. I used that time to write ruby-macho, the Mach-O parser that Homebrew uses to perform various binary relocations and fixups when installing a package into a user’s Homebrew prefix.

Mach-O is a relatively stable format, at least in terms of documented changes: the past few years have seen a handful of new load commands, additional CPU types and subtypes3, and further integration of Apple’s codesigning scheme into the format itself. These changes have in turn required routine, mostly small changes to ruby-macho.

In terms of undocumented changes, Mach-O is much less stable: Apple selectively re-uses and modifies the format in internal contexts4, like the prelinked kernel/”kernelcache” that Apple builds out of the kernel image and all loadable kernel extensions for accelerating the boot process. Being able to parse these unusual-looking Mach-Os would be (1) cool, and (2) useful for ascertaining whether Apple is sneaking any additional features into the Mach-O format without updating their public sources first.

Compressed Mach-Os

As mentioned above, the “Mach-O” that macOS and iOS use for their prelinked kernel is a little funky. One particular source of funkiness is its layout: the prelinked kernel appears as a universal (“fat”) Mach-O with a single architectural slice. That slice, in turn, is compressed.

Now, the “normal” thing to do would be to include an extra flag in the Mach-O header, indicating that the subsequent contents are compressed. But Apple is special: instead of the architectural slice containing a discernible Mach-O header (as required by Apple’s own spec!), it contains its own little special header that indicates the kind of compression and provides some other metadata.

From kext_tools/kernelcache.h:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#define COMP_TYPE_LZSS      'lzss'
#define COMP_TYPE_FASTLIB   'lzvn'

// prelinkVersion value >= 1 means KASLR supported
typedef struct prelinked_kernel_header {
    uint32_t  signature;
    uint32_t  compressType;
    uint32_t  adler32;
    uint32_t  uncompressedSize;
    uint32_t  compressedSize;
    uint32_t  prelinkVersion;
    uint32_t  reserved[10];
    char      platformName[PLATFORM_NAME_LEN]; // unused
    char      rootPath[ROOT_PATH_LEN];         // unused
    char      data[0];
} PrelinkedKernelHeader;

signature here is our magic in lieu of the normal Mach-O magic; it’s always 0x636f6d70 (i.e., "comp"), while compressType is either 0x6c7a7373 ("lzss"; COMP_TYPE_LZSS) or 0x6c7a766e ("lzvn"; COMP_TYPE_FASTLIB) depending on the kind of compression used. The rest is bookkeeping and/or documented as unused; the actual compressed contents presumably follow as the compressedSize bytes within the flexible data member.

So, what can we do about this? If it’s LZSS, then it’s no problem: there are plenty of LZSS implementations on the web. But what the hell is LZVN?

“LZVN”

As it turns out, Apple went and made their own data compression algorithm, way back in 2015: LZFSE.

LZFSE, in turn, contains LZVN, a simpler algorithm that LZFSE defaults to when the size of its input is below a threshold (LZFSE_ENCODE_LZVN_THRESHOLD). However, to make things confusing, Apple appears to use LZVN unconditionally for Mach-O (replacing the older LZSS scheme), even when the input is far above the normal LZFSE threshold.

LZVN is also conspicuously absent from Apple’s own list of supported compression_algorithms 🤔

Fortunately for us, we know that LZFSE requires LZVN, so it has to be implemented somewhere. Even more fortunately, Apple has released a cross-platform, reference implementation. Now we just need to pull LZVN out of it.

lzfse.rb

Here’s where my wasted Monday comes in: I spent a couple of hours5 writing Ruby bindings for the reference LZFSE implementation, including the ~spicy~ internal APIs for LZVN-only compression and decompression. You can find the bulk of that code (mixed in with Ruby’s C API) here. It probably has bugs.

lzfse.rb is available on RubyGems, so you can yoink it with gem:

1
$ gem install lzfse

Once you have it, there are only 4 public APIs, each of which takes a String and returns a new String:

1
2
3
4
5
6
7
8
9
require "lzfse"

# LZFSE (de)compression
LZFSE.lzfse_compress
LZFSE.lzfse_decompress

# LZNV (de)compression
LZFSE.lznv_compress
LZFSE.lznv_decompress

There’s also LZFSE::EncodeError and LZFSE::DecodeError, which the compression and decompression APIs will throw, respectively, on errors.

Now, the real task is to actually use the LZNV decompressor. You can follow me doing that work in ruby-macho#370. But, for the time being, it seems to be working!

1
2
3
4
5
require "macho"

file = MachO::FatFile.new \
        "/Library/Apple/System/Library/PrelinkedKernels/prelinkedkernel",
        decompress: true

  1. i.e., a somewhat generic container format for holding machine code and other program information in various states. Mach-O is to macOS as ELF is to Linux/BSD, as PE(32+) is to Windows. 

  2. Mach-O is unusual in that it distinguishes “shared objects” from “bundles,” where other object formats treat them the same. This distinction has virtually no bearing on the rest of this post, except for one curiosity: if you choose to build lzfse.rb for yourself, you’ll notice that it gets built as a bundle (lzfse.bundle) instead of a shared library (lzfse.dylib). 

  3. Sometimes even before they publicly announce the hardware details of a particular product! 

  4. i.e., primarily not userspace contexts where ruby-macho might conceivably be used by someone. 

  5. It shouldn’t have taken me this long, but I haven’t written a native Ruby extension in a while. 


Discussions: Reddit