ENOSUCHBLOG

Programming, philosophy, pedaling.


Python wheel filenames have no canonical form

Jun 12, 2024     Tags: curiosity, programming, python, security    


This short(-ish) post is a successor to 2022’s a most vexing parse, but for Python packaging. I discovered it the other day while doing it what I normally do: mucking through the guts of Python packaging.

The TL;DR: despite extensive canonicalization and normalization rules for Python package names and Python package versions, there is no canonicalization rule for wheel filenames.

This results in a funny situation: to know whether two wheel filenames are “equivalent” (i.e. accurately represent the same wheel), one must parse both filenames, rather than canonicalize either (or both) for a direct string comparison:

1
2
3
4
>>> name1 = "foo-1.2.3-py3-none-any.whl"
>>> name2 = "bar-1.2.3-py3-none-any.whl"
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2)
False

This distinction between parsing and canonicalizing for comparison rarely matters in practice1, but it’s a curiosity that I couldn’t find any other resources on. Hence this post.

Quick recap

Read the 2022 post for a full overview of Python packaging terminology and standards.

The very brief version: Python has two package formats: sdists for source distributions, and wheels for “built” distributions. The difference between the two is subtle since wheels are themselves sometimes distributions of source code, but essentially boils down to “a wheel is a partially precomputed source distribution that doesn’t need to run arbitrary code to be installed.”

sdists and wheels both have filename formats, defined in their respective living specifications. For historical and practical reasons, the wheel filename format is much more complicated than the sdist format, and includes metadata that the sdist doesn’t require in its filename: things like the Python version that the wheel is compatible with, the architecture it’s targeting, and so forth.

wheel filenames look like this:

1
2
# {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl
foo-1.2.3-py3-none-any.whl

Canonicalization

At various points during the standardization of Python packaging, people have realized that being able to canonicalize various packaging identifiers (package names, versions) was useful to the expanding ecosystem. In particular, canonical forms:

The wheel specification references two other living specifications for how to canonicalize its distribution and version components: name normalization (derived from PEP 503) and version specifiers (derived from PEP 440).

This leads to our first (but not last or most interesting!) example of a lack of a canonical form: in practice, wheel filenames are strict in requiring valid package names and versions, but not3 in requiring normalized variants of each:

1
2
3
4
5
6
7
>>> # these represent the same wheel, since package names are not case sensitive!
>>> name1 = "foo-1.2.3-py3-none-any.whl"
>>> name2 = "Foo-1.2.3-py3-none-any.whl"
>>> name1 == name2
False
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2)
True

…and similarly for versions:

1
2
3
4
5
6
7
>>> # `.alpha1` is a pre-normalized form of `a1`
>>> name1 = "foo-1.2.3.alpha1-py3-none-any.whl"
>>> name2 = "Foo-1.2.3a1-py3-none-any.whl"
>>> name1 == name2
False
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2)
True

You might note, however, that this is a minor hiccup for an application that needs a canonical form: just ensure that the package and version are in fact normalized and the wheel filename will also be normalized as a constructive property.

Right?

Tags and compressed tag sets

Not so right, sadly.

Taking another look at the wheel filename format, there are four more fields that we haven’t yet accounted for:

Together, this makes it sound like these fields pose no challenge: if they can’t be reordered or reduced, then they are already effectively canonical!

This is where the wheel specification leaves out a critical detail: a wheel can be tagged multiple times via compressed tag sets.

Compressed tag sets are not mentioned anywhere in the living wheel specification, but they’re part of it. They’re documented in their own living standard, for platform compatibility tags, which defines them as follows:

To allow for compact filenames of bdists that work with more than one compatibility tag triple, each tag in a filename can instead be a ‘.’-separated, sorted, set of tags. For example, pip, a pure-Python package that is written to run under Python 2 and 3 with the same source code, could distribute a bdist with the tag py2.py3-none-any. The full list of simple tags is:

1
2
3
4
for x in pytag.split('.'):
    for y in abitag.split('.'):
        for z in archtag.split('.'):
            yield '-'.join((x, y, z))

A bdist format that implements this scheme should include the expanded tags in bdist-specific metadata. This compression scheme can generate large numbers of unsupported tags and “impossible” tags that are supported by no Python implementation e.g. “cp33-cp31u-win64”, so use it sparingly.

This is an eminently reasonable feature: wheels are frequently compatible with more than one set of tags at a time, and the packaging tooling needs to be able to infer this without having to download, decompress, and extract the wheel’s interior metadata.

Unfortunately, this reasonable feature leads to our second (and more interesting!) source of non-canonicalization:

1
2
3
4
5
6
7
8
>>> # each name and version is the same, but the compressed tags vary (equivalently!)
>>> name1 = "foo-1.2.3-py3.py2-none-any.whl"
>>> name2 = "foo-1.2.3-py2.py3-none-any.whl"
>>> name3 = "foo-1.2.3-py2.py3-none.none-any.whl"
>>> name1 == name2 == name3
False
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2) == parse_wheel_filename(name3)
True

Here, the lack of canonicalization takes two works:

This lack of canonicalization is more profound than the former: because there’s no standard canonical form for compressed tag sets, any wheel that makes use of them has no standard canonical filename.

Does this matter?

In an ordinary setting, absolutely not. Most users don’t need to know anything about sdists, wheels, or anything interior to pip install example, which itself happily parses both sdist and wheel filenames regardless of their canonicalization.

In principle, however, the absence of a canonical form for wheel filenames makes them poorly suited as a domain key. For example, in a cryptographic setting, I might sign over a contrived payload of the form:

1
{dist-filename}:{hash}

…thereby binding the distribution’s contents (as a strong cryptographic hash) to its filename, which in turn ensures a binding to the package and version identifiers.

To correctly verify the resulting signature I would need to reconstruct or collect the exact same dist-filename. But there’s no guarantee that I can, since there’s no canonical form4!

Fixing it

As established above, this doesn’t matter very much, and probably shouldn’t be a particularly high priority in terms of things that need fixing in Python packaging.

With that being said, it’s also not very difficult to fix:

  1. The wheel specification should mention that the Python, ABI, and platform tags can be in an alternate compressed form, and link to the platform compatibility tags specification.
  2. The platform compatibility tags specification could mention an optional normalization for each compressed tag set, e.g. sorted(set(tags)) with the (currently correct, and trivially enforceable) assumption that tags are all UTF-8.

This would allow a (motivated) user to fully canonicalize a wheel filename, such that any canonicalized form would be reproducible from any other equivalent form.


  1. Although it can matter, read on. 

  2. Python packaging is fond of making these assumptions in all kinds of little bespoke syntaxes, such as the #egg= fragment. So it’s good that they’re standardized! 

  3. The binary distribution format says “should” for both name and version normalization, not “must.” 

  4. There’s a strong argument that this is just a plain bad signing design, and that newer signing designs should avoid unnecessary canonicalization. Which is frequently true.