E_NO_SUCH_BLOG

Programming, philosophy, pedaling.


Mini-post: Digraphs and Trigraphs

Apr 2, 2015

Tags: programming

C is an old language.

It’s arguably the oldest language still in widespread (non-specialty) use. Because C was created at a time when keyboard layouts, character sets, and architectures were all heavily dependent upon individual retailers and companies, its standards reflect many of the compromises made by early developers when making C mesh with systems not designed for it.

One of those compromises was the inclusion of trigraphs and, later, digraphs.

What are digraphs and trigraphs?

Simply put, digraphs and trigraphs are multi-character sequences treated by the C preprocessor and/or compiler as other (normally single-character) sequences.

Trigraphs are the earlier variant, and are defined as three-character sequences (hence tri-graph). Trigraphs are handled by the preprocessor, which means that the C compiler never actually sees them.

There are 9 standard trigraphs in C:

 Trigraph  |  Converts To
----------------------------
    ??=    |  # (hash)
    ??/    |  \ (backslash)
    ??'    |  ^ (carat)
    ??(    |  [ (l. bracket)
    ??)    |  ] (r. bracket)
    ??!    |  | (bar)
    ??<    |  { (l. brace)
    ??>    |  } (r. brace)
    ??-    |  ~ (tilde)

Digraphs were introduced later and included formally in the C99 standard. As their name suggests, they consist of two-character sequences each. Unlike their older counterparts, digraphs are ignored by the preprocessor and tokenized by the compiler instead.

There are 5 standard digraphs:

Digraph  |   Converts To
--------------------------
   <:    |   [ (l. bracket)
   :>    |   ] (r. bracket)
   <%    |   { (l. brace)
   %>    |   } (r. brace)
   %:    |   # (hash)

Why do they exist?

Back when keyboards were unstandardized and character codes were insane, many programmers found that they (literally) lacked the symbols they needed to write programs in foreign languages ported to their systems. As such, digraphs and trigraphs were used whenever the system or hardware was incapable of composing the “correct” character.

What can I do with them?

Coupled with other bad things, you can make your programs a nightmare to read:

??=include <stdio.h>
%:define __(s) printf("\x25\x73", s);

int main(void) ??<
	char _x9??(:> = "Please never actually do this??/n";
	// what does this line do ??/
	__(_x9);
%>

Compiled with gcc -std=c99, this raises no warnings. It also doesn’t produce any output, despite __ being macroed to printf. Why? Because that single-line comment is actually a multi-line comment, thanks to the backslash (??/) trigraph escaping the newline. Remove the trigraph, and the output appears.

Nowadays, there aren’t that many legitimate uses for digraphs and trigraphs. Programmers unfortunate enough to be working on legacy or extremely niche systems may find themselves using them when working with limited character sets, but the rest of us are just fine with our ASCII-derived charsets and 101-105 standard keyboards.

Summary

Digraphs and trigraphs are an interesting vestige of earlier times, but they aren’t very useful to the average programmer these days*.

If you are the kind of person who likes auto-flagellation and/or obfuscation golf, however, digraphs and trigraphs can be valuable tools to give your programs that extra ounce of inscrutability. They’re a staple in many obfuscation competitions, and you’ll probably even find them tucked away in older cross-platform programs.

Happy hacking!

- William

Postnotes:

* C is not the only language with {di,tri}graphs. Other languages, like Pascal, actually use them fairly extensively in their “normal” syntax. C++ also has its own pseudo n-graphs, such as and, or, not, and other English words for logical and bitwise operators. These can also be used in C via the iso646.h header.