ENOSUCHBLOG

Programming, philosophy, pedaling.


Finding treacherous encoding permutations with baseperm

Feb 3, 2020     Tags: devblog, programming, rust    

This post is at least a year old.

My first tool/library announcement for 2020 (and a rough 5 year blog anniversary, yay!): baseperm.

baseperm is a small Rust tool: all it does is take an input in some baseN alphabet as input (base64 and base32 are currently supported) and emit all permutations of that input that decode to the same thing as the original input.

In other words:

Let’s go into how it works and why it exists.

Theory of operation

baseN encoding schemes take their inputs and map them onto an alternate alphabet, usually one designed to operate in contexts where use of the full byte (or even just ASCII-printable) range would be ambiguous or interfere with the surrounding protocol.

A good example of this is SMTP, the protocol that handles email transfer between hosts. SMTP expects the session to end with a period surrounded by a pair of CRLFs, like so:

1
\r\n.\r\n

or, in hexadecimal:

1
\x13\x10\x46\x13\x10

This byte sequence could plausibly occur in a binary email attachment (or even as a typo in the main email body if using CRLF line endings), causing SMTP to prematurely hang up the connection and truncate the letter.

That would really suck, so we encode our attachments with an alphabet that can’t contain those bytes. Because that alphabet is necessarily smaller than the 8-bit (or 7-bit, if we’re just talking about ASCII) input space, encoding each byte requires more than one symbol:

symbols bits bytes per symbol
base32 32 5 0.625
base64 64 6 0.75
base58 58 approx. 5.86 approx. 0.73

As a result, not all encoded inputs are 8-bit aligned. To align things for representation as printable characters, padding bits are used: 4 bytes (32 bits) of input require 6 base64 symbols (36 bits), leaving 4 bits unused and meaningless from the perspective of the decoder.

Consequently, there are actually 16 base64 representations of a 4 byte input. For “abcd”:

  YWJjZA YWJjZB YWJjZC YWJjZD
YWJjZE YWJjZF YWJjZG YWJjZH
YWJjZI YWJjZJ YWJjZK YWJjZL
YWJjZM YWJjZN YWJjZO YWJjZP
  

(A small challenge: convince yourself that a zero-padded base64 encoding of any 4 bytes will always end in one of A, Q, g, or w.)

All baseperm does is automate the discovery and output of these permutations.

But why?

Most programmers are only vaguely aware of alignment and padding considerations when encoding to an alphabet smaller than their input. Even fewer translate that awareness into the following:

In other words: the inequality of arbitrary baseN strings does not imply the inequality of their underlying inputs — we can contrive baseN strings that are different (due to padding) that decode to the same input.

This has all sorts of interesting consequences:

Usage

You can get baseperm from cargo:

1
$ cargo install baseperm

Or by building it directly in the repo:

1
2
$ git clone https://github.com/woodruffw/baseperm && cd baseperm
$ cargo build

baseperm’s interface is minimal: it takes a single encoded input on stdin and emits every permuted output on stdout, separated by newlines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ echo -n "some input" | base64 | baseperm
c29tZSBpbnB1dA==
c29tZSBpbnB1dB==
c29tZSBpbnB1dC==
c29tZSBpbnB1dD==
c29tZSBpbnB1dE==
c29tZSBpbnB1dF==
c29tZSBpbnB1dG==
c29tZSBpbnB1dH==
c29tZSBpbnB1dI==
c29tZSBpbnB1dJ==
c29tZSBpbnB1dK==
c29tZSBpbnB1dL==
c29tZSBpbnB1dM==
c29tZSBpbnB1dN==
c29tZSBpbnB1dO==
c29tZSBpbnB1dP==

You can use -e, --encoding to set the encoding to something other than base64 (the default):

1
2
3
$ echo -n "another input" | base32 | baseperm -e base32
MFXG65DIMVZCA2LOOB2XI===
MFXG65DIMVZCA2LOOB2XJ===

See baseperm --help for the full list of supported encodings.