May 9, 2022 Tags: devblog, programming, python
This post is about a minor parsing ambiguity in the Python packaging ecosystem, one that cannot be resolved without additional context that isn’t captured by the packaging standards.
A great deal of emphasis is on minor: this ambiguity appears in relatively old package distributions, and is addressed by both package name normalization (PEP 426) and version normalization (PEP 440), each of which builds off of earlier “canonical” specifications of distribution names and versions.
In other words: it’s not really a problem for much of the ecosystem! But if you’re like me and you love an ambiguously decidable parse, you might find it interesting.
If you’re a C++ programmer, you might have heard of the most vexing parse.
For those who haven’t: it’s a syntactic ambiguity in C and C++ grammars, one that can’t be solved (in a standards-compliant manner) with parser context. It stems from C and C++’s declaration syntax:
1
2
3
4
5
// declares a function `foo`, taking no arguments and returning an integer.
int foo();
// declares a function `bar`, taking no arguments and returning a std::string.
std::string bar();
C and C++’s declaration semantics are uniform1: there’s no difference between declaring variables and declaring functions, except for in the type bound to the identifier.
This has a notable consequence: both variables and functions can be declared anywhere that either is valid, including within the bodies of other functions.
The simplest version of the most vexing parse derives directly from this rule. Consider the following C++:
1
2
3
4
5
int main(void) {
int foo();
foo = 1;
return foo;
}
What is foo
here? It could be either:
foo
of type int
, with default (zero) initialization
(0
for int
, but maybe something nontrivial for a non-primitive type)foo
of type int (void)
In context, a programmer who wrote this almost certainly intended the first interpretation: it’s unusual to declare a function within the body of another function, whereas value initialization is common. But C++ is unforgiving: the second interpretation is required to keep uniform declaration truly uniform, and so we get:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<source>: In function 'int main()':
<source>:5:10: warning: empty parentheses were disambiguated as a function declaration [-Wvexing-parse]
5 | int foo();
| ^~
<source>:5:10: note: remove parentheses to default-initialize a variable
5 | int foo();
| ^~
| --
<source>:5:10: note: or replace parentheses with braces to value-initialize a variable
<source>:6:7: error: assignment of function 'int foo()'
6 | foo = 1;
| ~~~~^~~
<source>:7:10: error: invalid conversion from 'int (*)()' to 'int' [-fpermissive]
7 | return foo;
| ^~~
| |
| int (*)()
Compiler returned: 1
Notably, this doesn’t happen when explicit value initialization is
performed, since C++ understands that a function declaration can’t take a
prvalue
as a parameter type. So this is perfectly fine, and produces no
warnings whatsoever:
1
2
3
4
5
int main(void) {
int foo(1);
foo = 2;
return foo;
}
This ambiguity is annoying, but has a limited blast radius in this form:
int foo()
is not an idiomatic way to default-initialize a variable, and
so novice C++ programmers can simply be taught to avoid it (and prefer
either int foo{}
or simply int foo = 0
).
Unfortunately, it doesn’t end there (if it did, it wouldn’t be all that vexing!). Consider the following:
1
2
3
4
5
6
7
8
9
10
11
12
struct Widget;
struct WidgetFrobulator {
WidgetFrobulator(Widget w);
Widget *get();
};
int main(void) {
WidgetFrobulator frob(Widget());
frob.get();
return 0;
}
Here again, there are two possible semantics:
frob
is a WidgetFrobulator
, value-initialized with an anonymous instance
of Widget
frob
is a function that takes a single anonymous parameter2
and returns a WidgetFrobulator
And once again, C++ chooses to preserve uniform declaration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<source>: In function 'int main()':
<source>:9:24: warning: parentheses were disambiguated as a function declaration [-Wvexing-parse]
9 | WidgetFrobulator frob(Widget());
| ^~~~~~~~~~
<source>:9:24: note: replace parentheses with braces to declare a variable
9 | WidgetFrobulator frob(Widget());
| ^~~~~~~~~~
| -
| { -
| }
<source>:10:8: error: request for member 'get' in 'frob', which is of non-class type 'WidgetFrobulator(Widget (*)())'
10 | frob.get();
| ^~~
Compiler returned: 1
This is a much more annoying case than the first: a moderately experienced C++
programmer could easily write the code above, expecting a normal initialization
and method call but getting an inscrutable
request for member ... of non-class type
error instead.
This parse is vexing for two reasons:
foo
or frob
as a definition (plus initialization) instead.At a high level, the data model for Python packages looks something like this:
1
Project -> <Release>* -> <Distribution>*
In English: a project can have zero or more releases, each of which can, in turn, have zero or more distributions.
Projects correspond to human-readable identifiers in the global Python packaging
namespace: requests
,
cryptography
, &c.
Releases correspond to the human notion of “versions”: the specific set of features, changes, bugfixes, &c that make up a discrete copy of the project at a point in time.
Finally, distributions correspond to the installable assets for a release. These contain the actual code for the release, and come in a surprisingly large number of formats4. For the purposes of this post, we’ll only discuss the two that currently PyPI allows and encourages5:
Source distributions (“sdists”) are archives that contain the source code
for the release, plus a setup.py
or other file that setuptools
6
knows how to interpret.
Installing a source distribution effectively involves
executing arbitrary code via setup.py
to perform the build and
installation steps, including all native extension compilation (if necessary)
and dependency resolution. These sources of dynamism are why
PyPI doesn’t know your project’s dependencies.
Built distributions (“bdists”) are also archives, but without a
setup.py
or other builder script. Instead, they describe their contents
and dependencies via metadata, and are installed by copying their contents
directly into the appropriate package directory7. There
are many different kinds of bdists, but we only care about one for this
post’s purpose: wheels. For the rest of the post, I’ll use bdist and
wheel interchangeably.
As their name implies, bdists exist to solve the dual problems of client-side compilation and dependency resolution: client-side compilation is addressed by embedding the compiled extensions (shared objects, dylibs, whatever) directly into the bdist, and dependency resolution is addressed by precomputing the dependencies and embedding them as metadata.
Because bdists can contain platform-specific compiled extensions and
compiled extensions are (usually) compiled against a specific version of
Python’s C API8, they can’t be
installed arbitrarily. Instead, the package installer (e.g. pip
) must check
whether the host is compatible with each wheel distribution and, if none
match, fall back to the source distribution for the release.
Package installers generally prefer wheels over source distributions, for all of the reasons stated above (and as an added plus, they’re generally much faster to install).
However, to actually choose a wheel over its corresponding source distribution for a release, a package installer must go through a search process. Roughly9:
Enumerate all compatible releases for the requested project. For example,
pip install fakepackage>=8
will search each of 8.1.1
, 8.1.0
, 8.0.1
, &c, as
required, in order by semantic version.
In order, for each compatible release: enumerate all wheels for the release, filtering by:
Requires-Python
; wheels that aren’t compatible with the current Python
version are filtered out.pp
for PyPy or cp
for CPython; wheels that aren’t compatible
with the current implementation (or aren’t py
, i.e.
implementation-agnostic) are filtered out.cp37
to
indicate compatibility with the CPython 3.7 ABI; wheels that aren’t ABI
compatible with the current implementation (or aren’t abi3
, i.e.
compatible with the CPython 3 universal ABI) are filtered out.If any wheels remain after filtering, all are considered compatible with the current host and can be chosen for installation10. If no wheels remain, the source distribution is attempted, if present. If no sdist is present and each release fails all steps above (or any sdist installation fails), resolution is impossible.
This is all pretty complicated! But wait, there’s more: we don’t want to have to download and unpack each candidate distribution in order to filter (or select) it; that would require a lot of time and produce a lot of unnecessary traffic (and therefore hosting burden).
Consequently: PEP 503 specifies
the “Simple Repository API”, which provides a basic index that Python
package installers can depend on. That index is essentially a big
HTML page, with smaller HTML pages for each individual package. Each page
contains a list of links (just <a href=...></a>
), one per distribution.
Here’s an excerpt from pip-audit
’s simple index:
1
2
3
4
5
6
7
8
<a href="https://files.pythonhosted.org/packages/c3/1f/LONG-HASH/pip-audit-2.1.0.tar.gz#sha256=LONG-HASH" data-requires-python=">=3.7">pip-audit-2.1.0.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/f6/fe/LONG-HASH/pip_audit-2.1.0-py3-none-any.whl#sha256=LONG-HASH" data-requires-python=">=3.7">pip_audit-2.1.0-py3-none-any.whl</a><br/>
<a href="https://files.pythonhosted.org/packages/75/1c/LONG-HASH/pip-audit-2.1.1.tar.gz#sha256=LONG-HASH" data-requires-python=">=3.7">pip-audit-2.1.1.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/5e/40/LONG-HASH/pip_audit-2.1.1-py3-none-any.whl#sha256=LONG-HASH" data-requires-python=">=3.7">pip_audit-2.1.1-py3-none-any.whl</a><br/>
<a href="https://files.pythonhosted.org/packages/1b/07/LONG-HASH/pip-audit-2.2.0.tar.gz#sha256=LONG-HASH" data-requires-python=">=3.7">pip-audit-2.2.0.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/7a/94/LONG-HASH/pip_audit-2.2.0-py3-none-any.whl#sha256=LONG-HASH" data-requires-python=">=3.7">pip_audit-2.2.0-py3-none-any.whl</a><br/>
<a href="https://files.pythonhosted.org/packages/98/d9/LONG-HASH/pip-audit-2.2.1.tar.gz#sha256=LONG-HASH" data-requires-python=">=3.7">pip-audit-2.2.1.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/cc/43/LONG-HASH/pip_audit-2.2.1-py3-none-any.whl#sha256=LONG-HASH" data-requires-python=">=3.7">pip_audit-2.2.1-py3-none-any.whl</a><br/>
This is the final missing piece: Python package installers fetch these index pages for each package requested. All distributions across all versions are included in the same list, and are disambiguated by their filenames, which are rendered as the text of each link in the index.
Using the excerpts above, we can clearly see how source distributions and wheels have separate filename conventions:
1
2
3
4
5
6
7
8
9
10
pip-audit-2.0.0.tar.gz
pip_audit-2.0.0-py3-none-any.whl
pip-audit-2.1.0.tar.gz
pip_audit-2.1.0-py3-none-any.whl
pip-audit-2.1.1.tar.gz
pip_audit-2.1.1-py3-none-any.whl
pip-audit-2.2.0.tar.gz
pip_audit-2.2.0-py3-none-any.whl
pip-audit-2.2.1.tar.gz
pip_audit-2.2.1-py3-none-any.whl
You’ll observe that there’s a decent amount of structure in the wheel filenames
(the ones ending with .whl
), while the source distribution names look a
little…sparse. This is where our troubles begin.
To recap: Python package installers go through a resolution process to determine whether (and which) wheel to install. If they can’t find a wheel, they fall back on a source distribution. All of this is done by parsing filenames embedded in links in a PEP 503-standardized simple index hosted by a provider like PyPI.
For wheels, these filenames are standardized: per PEP 491, every wheel filename is of the form:
1
{distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl
…where python tag
, &c were described earlier.
As we saw earlier, source distributions have a much simpler filename form:
1
{distribution}-{version}.tar.gz
Source distribution filenames are notably not standardized for source distributions11,12; the closest thing to a standard is this normative guidance in the PyPA guidelines:
The file name of a sdist is not currently standardized, although the de facto form is
{name}-{version}.tar.gz
, where{name}
is the canonicalized form of the project name (see PEP 503 for the canonicalization rules) with-
characters replaced with_
13, and {version} is the canonicalized form of the project version (see Version specifiers).
(Their formatting; footnote mine.)
Given the description above, this is how
pypa/packaging
parses an sdist filename:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def parse_sdist_filename(filename: str) -> Tuple[NormalizedName, Version]:
if filename.endswith(".tar.gz"):
file_stem = filename[: -len(".tar.gz")]
elif filename.endswith(".zip"):
file_stem = filename[: -len(".zip")]
else:
raise InvalidSdistFilename(
f"Invalid sdist filename (extension must be '.tar.gz' or '.zip'):"
f" {filename}"
)
# We are requiring a PEP 440 version, which cannot contain dashes,
# so we split on the last dash.
name_part, sep, version_part = file_stem.rpartition("-")
if not sep:
raise InvalidSdistFilename(f"Invalid sdist filename: {filename}")
name = canonicalize_name(name_part)
version = Version(version_part)
return (name, version)
Note the comment: a (normalized) PEP 440 version cannot contain dashes,
so splitting on the last dash in order to split the {name}
and {version}
components is perfectly fine.
Here is our vexing parse: there are lots of source distributions with nonstandard versions. PEP 440 wasn’t accepted until August 2014 and wasn’t enforced on PyPI until some point after that14.
For example, here are the sdists and wheels for cffi==1.0.2-2
:
1
2
3
4
5
6
7
8
9
10
11
cffi-1.0.2-2-cp26-none-win32.whl
cffi-1.0.2-2-cp26-none-win_amd64.whl
cffi-1.0.2-2-cp27-none-win32.whl
cffi-1.0.2-2-cp27-none-win_amd64.whl
cffi-1.0.2-2-cp32-none-win32.whl
cffi-1.0.2-2-cp32-none-win_amd64.whl
cffi-1.0.2-2-cp33-none-win32.whl
cffi-1.0.2-2-cp33-none-win_amd64.whl
cffi-1.0.2-2-cp34-none-win32.whl
cffi-1.0.2-2-cp34-none-win_amd64.whl
cffi-1.0.2-2.tar.gz
The wheel filenames parse just fine, since -2
is a valid build tag:
1
2
3
>>> from packaging.utils import parse_wheel_filename
>>> parse_wheel_filename("cffi-1.0.2-2-cp27-none-win_amd64.whl")
('cffi', <Version('1.0.2')>, (2, ''), frozenset({<cp27-none-win_amd64 @ 140706518250944>}))
But the sdist doesn’t do too hot:
1
2
3
>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi-1-0-2', <Version('2')>)
Uh oh! packaging
(the library that underpins
pip
and
PyPI) thinks
that we’re dealing with a package named cffi-1-0-2
version 2
, rather
than cffi
version 1.0.2-2
.
To make things more confusing: cffi-1-0-2
is a perfectly valid (and normalized!)
distribution name, meaning that it would be perfectly valid to publish a package
named cffi-1-0-2
on PyPI.
Is this actually a problem? Well, it depends.
On one hand, recent versions of pip
(I tested with 22.0.4
) will outright
refuse to install or even download cffi==1.0.2-2
15, because they can’t make sense
of the mismatch between its version and metadata:
1
2
3
4
5
6
7
8
$ pip install cffi==1.0.2-2
Collecting cffi==1.0.2-2
Using cached cffi-1.0.2-2.tar.gz (317 kB)
Preparing metadata (setup.py) ... done
WARNING: Requested cffi==1.0.2-2 from https://files.pythonhosted.org/packages/ef/23/c6f7003ebb7b4b3fe4872f112b18ee181a3ec2b137e964093a8b35d4a5bd/cffi-1.0.2-2.tar.gz#sha256=6dc6ae05816e44c71094049321403fda1013013d68506f30914a59683a47fd32, but installing version 1.0.2
Discarding https://files.pythonhosted.org/packages/ef/23/c6f7003ebb7b4b3fe4872f112b18ee181a3ec2b137e964093a8b35d4a5bd/cffi-1.0.2-2.tar.gz#sha256=6dc6ae05816e44c71094049321403fda1013013d68506f30914a59683a47fd32 (from https://pypi.org/simple/cffi/): Requested cffi==1.0.2-2 from https://files.pythonhosted.org/packages/ef/23/c6f7003ebb7b4b3fe4872f112b18ee181a3ec2b137e964093a8b35d4a5bd/cffi-1.0.2-2.tar.gz#sha256=6dc6ae05816e44c71094049321403fda1013013d68506f30914a59683a47fd32 has inconsistent version: filename has '1.0.2.post2', but metadata has '1.0.2'
ERROR: Could not find a version that satisfies the requirement cffi==1.0.2-2 (from versions: 0.1, 0.2, 0.2.1, 0.3, 0.4, 0.4.1, 0.4.2, 0.5, 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.9.0, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2.post2, 1.0.3, 1.1.0, 1.1.1, 1.1.2, 1.2.0.post1, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.7.0, 1.8.2, 1.8.3, 1.9.0, 1.9.1, 1.10.0, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.11.4, 1.11.5, 1.12.0, 1.12.1, 1.12.2, 1.12.3, 1.13.0, 1.13.1, 1.13.2, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0rc1, 1.15.0rc2, 1.15.0)
ERROR: No matching distribution found for cffi==1.0.2-2
I haven’t fully root-caused this, but I’m guessing it’s because pip
makes a
round-trip through packaging.version.parse
in an attempt to produce a PEP 440-compliant
version from a “legacy” version:
1
2
3
>>> from packaging.version import parse
>>> parse("1.0.2-2")
<Version('1.0.2.post2')>
Is it a problem anywhere else?
I only know about this because I ran into it with pip-audit
,
a tool that I helped develop — we have our own dependency resolver (which
uses resolvelib
internally), and a
user reported an incorrect resolution
because we had assumed that parse_sdist_filename
would work on all source distributions.
The fix was thankfully simple: we simply ignore any candidates that don’t match
the expected distribution name.
To summarize: when wheel candidates have been exhausted, pip
and other Python
package installers must fall back on source distributions. To do this correctly,
they must parse the version out of the source distribution filename.
At the same time, it is incorrect to assume that every
source distribution on PyPI (or any other host) has an unambiguously
parseable filename. The end result: pip
and other tools either fail to
install some (theoretically) valid packages, or they must rely on side-channel
information to complete the parse. There is no “in-channel” solution; hence “vexing.”
As mentioned at the very beginning: this is not a significant problem for
ordinary package consumers in the Python ecosystem in 2022. It primarily
rears its head on old packages with “legacy” versions or versions that weren’t
properly normalized, and pip
itself “circumvents” the problem by
simply failing to match the version (which is arguably a bug, but
not a particularly important one)16.
What’s more: only source distribution filenames are ambiguous, and
wheels are nearly universal for the top
packages on PyPI. Modern versions of pip
should always prefer wheels
over source distributions, and you can completely disable source
distributions during resolution with pip install --only-binary=:all:
.
It’s also not a bug in any of the current Python packaging code:
utilities like packaging.utils.parse_sdist_filename
are well-defined
with respect to their expected inputs, and only fail because some possible
inputs are not well-defined.
Instead, I like to think of it as a little parable on how easy it is to
parsing ambiguities (and therefore failure modes) in contexts where parsing
might not strictly be necessary. For example: the Simple Index format encourages
consumers to parse the filenames, but also supplies some metadata as hidden
attributes on each link (for example, Requires-Python
becomes a
data-requires-python
attribute). This pattern could be extended to provide
unambiguous fields for each component of the sdist (and wheel!) filenames;
maybe I’ll write a PEP to propose that.
And as a closing note: with any luck, none of this will matter soon!
If accepted PEP 691 will standardize
PyPI’s JSON API, allowing pip
to query it instead. The JSON API
is much more structured, and requires no filename parsing!
Not to be confused with uniform initialization, which was introduced with C++11 and “solves” the most vexing parse by offering an unambiguous initialization syntax. ↩
The fact that Widget()
gets evaluated to an anonymous function parameter is one of the single most confusing things I’ve ever encountered in C++. It happens because of type decay: Widget()
decays to Widget <anonymous>()
, i.e. a function (pointer) that takes no parameter and returns a Widget
. Terrible! ↩
This is solved in most other languages: Rust, for example, permits only a single declaration of an identifier in the entire program and has non-uniform syntax for declaring different types: fn ...
for function definitions (which are always fully defined, not forward declared) and let
(or const
) for introducing a binding. ↩
Among then: bdist_rpm
(builds an RPM-compatible distribution), bdist_dmg
(for macOS-style .dmg
installers), and so forth. Most of these are deprecated or will be shortly. ↩
I’m not going to bother talking about eggs, since they’re deprecated on the packager side and strongly discouraged. ↩
Or the legacy distutils, although hopefully not for long. ↩
It’s not actually that simple. But we can pretend that it is, particularly compared to the arbitrary behavior permitted by source distributions. ↩
Among many other platform specificities: the kind and version of libc
being run, the OS itself, &c. ↩
This is a very rough sketch of how distribution selection works. There are many Python package installers (and many versions of pip
), and many tunable knobs for resolution and distribution selection. This is only meant to be the “intuitive” search, not necessarily reflective of how pip install ...
will actually resolve a distribution on your machine. ↩
I’m not actually sure which is chosen consistently, if any. I would expect it to be the “most specific” wheel (i.e. closest ABI, closest Python Version, &c) but I couldn’t find a guarantee for (or source code indicating) that behavior. ↩
They’re very weakly standardized by PEP 517, which only asserts the current format as a historical format. ↩
That being said, PEP 625 is currently on the standards track and will solidify a filename format that roughly resembles the current one (with .sdist
instead of .tar.gz
and with wheel-style normalization). ↩
This doesn’t appear to be done in practice. For example, pip-audit==2.2.1
on PyPI has an sdist of pip-audit-2.2.1.tar.gz
, not pip_audit-2.2.1.tar.gz
. There’s been some discussion on that discrepancy in this thread. ↩
I’m not sure when. Please let me know if you know! ↩
It works with --use-deprecated=legacy-resolver
, but that can’t be relied on in the future. ↩
I was also told that pip
will disambiguate the distribution name from the version since it already knows the distribution name as part of pip install name
, but I wasn’t able to actually reproduce this (since so many of the packages with “legacy” versions don’t pass metadata validation at all). But that doesn’t mean it isn’t correct; I just couldn’t find an example on PyPI. ↩