Jun 12, 2024 Tags: curiosity, programming, python, security
This short(-ish) post is a successor to 2022’s a most vexing parse, but for Python packaging. I discovered it the other day while doing it what I normally do: mucking through the guts of Python packaging.
The TL;DR: despite extensive canonicalization and normalization rules for Python package names and Python package versions, there is no canonicalization rule for wheel filenames.
This results in a funny situation: to know whether two wheel filenames are “equivalent” (i.e. accurately represent the same wheel), one must parse both filenames, rather than canonicalize either (or both) for a direct string comparison:
1
2
3
4
>>> name1 = "foo-1.2.3-py3-none-any.whl"
>>> name2 = "bar-1.2.3-py3-none-any.whl"
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2)
False
This distinction between parsing and canonicalizing for comparison rarely matters in practice1, but it’s a curiosity that I couldn’t find any other resources on. Hence this post.
Read the 2022 post for a full overview of Python packaging terminology and standards.
The very brief version: Python has two package formats: sdists for source distributions, and wheels for “built” distributions. The difference between the two is subtle since wheels are themselves sometimes distributions of source code, but essentially boils down to “a wheel is a partially precomputed source distribution that doesn’t need to run arbitrary code to be installed.”
sdists and wheels both have filename formats, defined in their respective living specifications. For historical and practical reasons, the wheel filename format is much more complicated than the sdist format, and includes metadata that the sdist doesn’t require in its filename: things like the Python version that the wheel is compatible with, the architecture it’s targeting, and so forth.
wheel filenames look like this:
1
2
# {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl
foo-1.2.3-py3-none-any.whl
At various points during the standardization of Python packaging, people have realized that being able to canonicalize various packaging identifiers (package names, versions) was useful to the expanding ecosystem. In particular, canonical forms:
The wheel specification references two other living specifications
for how to canonicalize its distribution
and version
components:
name normalization (derived from PEP 503)
and version specifiers (derived from PEP 440).
This leads to our first (but not last or most interesting!) example of a lack of a canonical form: in practice, wheel filenames are strict in requiring valid package names and versions, but not3 in requiring normalized variants of each:
1
2
3
4
5
6
7
>>> # these represent the same wheel, since package names are not case sensitive!
>>> name1 = "foo-1.2.3-py3-none-any.whl"
>>> name2 = "Foo-1.2.3-py3-none-any.whl"
>>> name1 == name2
False
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2)
True
…and similarly for versions:
1
2
3
4
5
6
7
>>> # `.alpha1` is a pre-normalized form of `a1`
>>> name1 = "foo-1.2.3.alpha1-py3-none-any.whl"
>>> name2 = "Foo-1.2.3a1-py3-none-any.whl"
>>> name1 == name2
False
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2)
True
You might note, however, that this is a minor hiccup for an application that needs a canonical form: just ensure that the package and version are in fact normalized and the wheel filename will also be normalized as a constructive property.
Right?
Not so right, sadly.
Taking another look at the wheel filename format, there are four more fields that we haven’t yet accounted for:
build tag
. In practice this is a mostly free-form string with
no interior structure besides a leading digit (such as 1ubuntu
), meaning
that it’s already in an irreducible normal form. Whoopee!python tag
, abi tag
, and platform tag
. Each of these is
disjoint, irreducible and can’t be swapped with each other (meaning that
none-py3-any
is not a valid substitution for the correct py3-none-any
).Together, this makes it sound like these fields pose no challenge: if they can’t be reordered or reduced, then they are already effectively canonical!
This is where the wheel specification leaves out a critical detail: a wheel can be tagged multiple times via compressed tag sets.
Compressed tag sets are not mentioned anywhere in the living wheel specification, but they’re part of it. They’re documented in their own living standard, for platform compatibility tags, which defines them as follows:
To allow for compact filenames of bdists that work with more than one compatibility tag triple, each tag in a filename can instead be a ‘.’-separated, sorted, set of tags. For example, pip, a pure-Python package that is written to run under Python 2 and 3 with the same source code, could distribute a bdist with the tag
py2.py3-none-any
. The full list of simple tags is:
1 2 3 4 for x in pytag.split('.'): for y in abitag.split('.'): for z in archtag.split('.'): yield '-'.join((x, y, z))A bdist format that implements this scheme should include the expanded tags in bdist-specific metadata. This compression scheme can generate large numbers of unsupported tags and “impossible” tags that are supported by no Python implementation e.g. “cp33-cp31u-win64”, so use it sparingly.
This is an eminently reasonable feature: wheels are frequently compatible with more than one set of tags at a time, and the packaging tooling needs to be able to infer this without having to download, decompress, and extract the wheel’s interior metadata.
Unfortunately, this reasonable feature leads to our second (and more interesting!) source of non-canonicalization:
1
2
3
4
5
6
7
8
>>> # each name and version is the same, but the compressed tags vary (equivalently!)
>>> name1 = "foo-1.2.3-py3.py2-none-any.whl"
>>> name2 = "foo-1.2.3-py2.py3-none-any.whl"
>>> name3 = "foo-1.2.3-py2.py3-none.none-any.whl"
>>> name1 == name2 == name3
False
>>> parse_wheel_filename(name1) == parse_wheel_filename(name2) == parse_wheel_filename(name3)
True
Here, the lack of canonicalization takes two works:
py2.py3
and py3.py2
are
both equally normal forms with the same meaning.py2.py2.py2
or
none.none.none
or linux_x86_64.linux_x86_64
are all correct (but wasteful) compressed
sets.This lack of canonicalization is more profound than the former: because there’s no standard canonical form for compressed tag sets, any wheel that makes use of them has no standard canonical filename.
In an ordinary setting, absolutely not. Most users don’t need to know anything about
sdists, wheels, or anything interior to pip install example
, which itself happily
parses both sdist and wheel filenames regardless of their canonicalization.
In principle, however, the absence of a canonical form for wheel filenames makes them poorly suited as a domain key. For example, in a cryptographic setting, I might sign over a contrived payload of the form:
1
{dist-filename}:{hash}
…thereby binding the distribution’s contents (as a strong cryptographic hash) to its filename, which in turn ensures a binding to the package and version identifiers.
To correctly verify the resulting signature I would need to reconstruct
or collect the exact same dist-filename
. But there’s no guarantee that
I can, since there’s no canonical form4!
As established above, this doesn’t matter very much, and probably shouldn’t be a particularly high priority in terms of things that need fixing in Python packaging.
With that being said, it’s also not very difficult to fix:
sorted(set(tags))
with
the (currently correct, and trivially enforceable) assumption that tags
are all UTF-8.This would allow a (motivated) user to fully canonicalize a wheel filename, such that any canonicalized form would be reproducible from any other equivalent form.
Although it can matter, read on. ↩
Python packaging is fond of making these assumptions in all kinds of little bespoke syntaxes, such as the #egg=
fragment. So it’s good that they’re standardized! ↩
The binary distribution format says “should” for both name and version normalization, not “must.” ↩
There’s a strong argument that this is just a plain bad signing design, and that newer signing designs should avoid unnecessary canonicalization. Which is frequently true. ↩