ENOSUCHBLOG

Programming, philosophy, pedaling.

The unsafe language doom principle

Feb 11, 2023 Tags: c, programming, rust, security

This post is at least a year old.

Moxie Marlinspike’s Cryptographic Doom Principle is well-known in cryptography circles, and reads as follows:

if you have to perform any cryptographic operation before verifying the MAC on a message you’ve received, it will somehow inevitably lead to doom.

I’ve decided to be a little inflammatory today¹ and offer a parallel principle: developing software in an “unsafe” programming language² will somehow inevitably lead to doom.

Well-known (and good!) analyses of safe and unsafe languages focus on aspects like memory corruption, about which a great deal of ink has been spilled.

The responses to this are also somewhat well known:

That competent (or sufficiently meticulous) programmers can avoid memory unsafety by sticking to best practices and using ample unit testing, sanitizers, and fuzzers;
That many (but not all) unsound patterns can be detected with (bounded) model checkers, abstract interpretation, symbolic execution, or other forms of static analysis;
That “safe” languages are (roughly) as vulnerable to logic, serialization, and other non-memory-unsafety bugs, and that a decline in memory unsafety will not meaningfully impact the overall security posture of real-world programs.

Rather than re-litigating these points (others have done an excellent job already), I’ll focus on other, though related, ways in which development in unsafe languages inevitably leads to doom.

False confidence

Whether their practitioners will admit it or not³, there’s a certain amount of cool machismo affiliated⁴ with low-level and systems programming, especially when that programming is in unsafe languages.

The effect is comparable to motorcycles, cigarettes, and bald French continental philosophers: insufferable when anybody else uses (or quotes) them, but irresistible for our own purposes (and to our own egos).

This machismo encourages a kind of false confidence, the same kind that causes people to ride motorcycles without helmets, chain smoke cigarettes, and quote Foucault to unimpressed first dates.

In the context of unsafe languages this false confidence manifests in myriad ways, each of which leads to doom:

Confident insistence that C is a “high level assembler,” meaning that the programmer can treat it primarily as a clever, platform-independent macro expander for the host’s assembly language.

This is both true and not true: C exposes the semantics of the underlying architecture, but through the constraining lens of the C abstract machine.

The C abstract machine is in turn defined in precise terms by the C standard⁵, and imposes all kinds of non-native restrictions on the behavior of programs.

For example: correct C programs must widen and narrow expressions even when their naive machine representation does not require widening or narrowing. This means that the following⁶:
```
1
2
3
char c1, c2;
/* ... */
c1 = c1 + c2;
```
…cannot be naively lowered into an ADD between char-sized operands; the implementation must behave as if c1 and c2 are first widened to int, then added, then narrowed back to char.

Assuming otherwise leads to doom: it’s easy to miss that c1 + c2 cannot observably overflow under C’s abstract machine semantics, and write an incorrect (and exploitable) program based on that incorrect assumption.
Confident insistence that a C program “says what it does,” with no significant deviations from what an hand-written assembler equivalent would do (modulo constraints imposed by the abstract machine).

As implied above, this too is wrong: the C standard doesn’t say “follow the machine’s semantics closely, deviating only when required”. It says “behave as if you are executed on the abstract machine.”

This means, in particular, that C compilers are free to aggressively remove variables, expressions, and even entire functions so long as the compiler can prove that the program’s behavior is no different than it would be on the abstract machine.

Assuming otherwise leads to doom: forgetting to mark memory that’s been mapped to I/O or a device as volatile means that the compiler is free to observe that the values at that memory never change, and thus not to load from them beyond whatever initial values can be inferred.
Confident insistence that the compiler will perform all “obvious” optimizations, such as deduplicating identical functions across (or even within) translation units, or fully taking advantage of alias analysis during optimization.

Once again this is wrong: the C standard prevents many reasonable optimizations, largely because they cannot be proven sound under the abstract machine.

Function deduplication and generalized alias analysis fall under this category:
- C requires that all functions have unique addresses, meaning that inadvertently taking the address of a function (e.g. in a dispatch table) can pessimize code elimination.
- C cannot generally deduce that foo_t *x and foo_t *y in the same scope must or must not alias, meaning that it cannot optimize loads and stores between the two under either of those assumptions: it must instead pessimistically assume that x and y may alias.
Assuming otherwise leads to both performance doom and security doom: performance doom because more code means slower code, and security doom because C compilers do perform aggressive alias analysis and optimize expressions when confident C programmers least expect it.

In some cases these confidently incorrect beliefs can be automatically caught. But the culture of false confidence cannot itself be caught: C is a motorcycle, and turning it into a tricycle removes the “cool” factor that attracts so many people to it.

Instead of being dangerous and fun like cigarettes (or French philosophy), C needs to be dangerous and unfun like sewage systems: something that experts handle humble, carefully, and out of necessity, with designs that reflect the societal repercussions of any mistakes.

Someone somewhere is compiling your code on something weird

Previously on this blog: Weird architectures weren’t supported to begin with.

Let’s say, for the sake of argument, that you are an excellent C programmer: you both employ best practices for C development and you’re aware of C’s undefined, unspecified, and implementation-defined behavior. You don’t get caught by the things in the above section because you’re intimately familiar with the C standard and simply know better than to do things the wrong way.

Because you’re an expert, you also know your platform, your compiler, and other sources of variability that the standard explicitly allows: you make valid and sound design decisions based on implementation-defined behavior, because you know that those decisions are correct for the platform being targeted.

Nothing about this is wrong, but it still leads to doom: someone other than you will compile your code for a platform and with a compiler that you don’t expect. Your behavior is still implementation-defined, and that implementation is wrong from your program’s perspective:

The sizeof(...) expression that you (correctly!) proved would never calculate an OOB pointer now does so;
char has gone from unsigned to signed, meaning that previously valid calls to isdigit, etc. are now undefined behavior;

Even when you meticulously and scrupulously avoid implementation-defined behavior, doom still happens: compilers and optimizations change, and C has no stable ABI to fall back on⁷ to ensure that linkages between different objects occur correctly. Every linkage against a system or vendor-provided object is a gamble and you will eventually lose that gamble, whether you play or not.

Latent doom and rewriting unsafe code

These are not the only sources of doom in unsafe programming languages; they’re not even the primary sources (as mentioned, unsafety itself is quantifiable and identifiable with specific language features).

The fact that unsafe languages lead to doom is of course not evidence that safe programming languages don’t lead to doom. Rewriting mature components in new languages tends to reveal undocumented assumptions and invariants, which then become potential security concerns when the rewrite no longer maintains or respects them.

Really a few days ago; the idea for this post came from this HN thread. ↩
Take your pick; the most commonly cited ones are C and C++. The rest of the post assumes a language roughly resembling C in terms of unsafety and abstract semantics. ↩
Not admitting it is part of the machismo. ↩
By other programmers. This is all stupid, dorky stuff in any other context. ↩
§5.1.2.3 “Program execution” in C17. ↩
Borrowed directly from the C17 standard, §5.1.2.3, Example 2. ↩
Nor does Rust, for that matter, but Rust sidesteps the problem entirely by making you recompile everything at once. ↩

Discussions: Reddit Twitter Mastodon

Previously

Newer