ENOSUCHBLOG

Programming, philosophy, pedaling.


A most vexing parse, but for Python packaging

May 9, 2022     Tags: devblog, programming, python    

This post is at least a year old.

Preword

This post is about a minor parsing ambiguity in the Python packaging ecosystem, one that cannot be resolved without additional context that isn’t captured by the packaging standards.

A great deal of emphasis is on minor: this ambiguity appears in relatively old package distributions, and is addressed by both package name normalization (PEP 426) and version normalization (PEP 440), each of which builds off of earlier “canonical” specifications of distribution names and versions.

In other words: it’s not really a problem for much of the ecosystem! But if you’re like me and you love an ambiguously decidable parse, you might find it interesting.


Background

If you’re a C++ programmer, you might have heard of the most vexing parse.

For those who haven’t: it’s a syntactic ambiguity in C and C++ grammars, one that can’t be solved (in a standards-compliant manner) with parser context. It stems from C and C++’s declaration syntax:

1
2
3
4
5
// declares a function `foo`, taking no arguments and returning an integer.
int foo();

// declares a function `bar`, taking no arguments and returning a std::string.
std::string bar();

C and C++’s declaration semantics are uniform1: there’s no difference between declaring variables and declaring functions, except for in the type bound to the identifier.

This has a notable consequence: both variables and functions can be declared anywhere that either is valid, including within the bodies of other functions.

The simplest version of the most vexing parse derives directly from this rule. Consider the following C++:

1
2
3
4
5
int main(void) {
  int foo();
  foo = 1;
  return foo;
}

What is foo here? It could be either:

  1. A variable foo of type int, with default (zero) initialization (0 for int, but maybe something nontrivial for a non-primitive type)
  2. A function foo of type int (void)

In context, a programmer who wrote this almost certainly intended the first interpretation: it’s unusual to declare a function within the body of another function, whereas value initialization is common. But C++ is unforgiving: the second interpretation is required to keep uniform declaration truly uniform, and so we get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<source>: In function 'int main()':
<source>:5:10: warning: empty parentheses were disambiguated as a function declaration [-Wvexing-parse]
    5 |   int foo();
      |          ^~
<source>:5:10: note: remove parentheses to default-initialize a variable
    5 |   int foo();
      |          ^~
      |          --
<source>:5:10: note: or replace parentheses with braces to value-initialize a variable
<source>:6:7: error: assignment of function 'int foo()'
    6 |   foo = 1;
      |   ~~~~^~~
<source>:7:10: error: invalid conversion from 'int (*)()' to 'int' [-fpermissive]
    7 |   return foo;
      |          ^~~
      |          |
      |          int (*)()
Compiler returned: 1

Notably, this doesn’t happen when explicit value initialization is performed, since C++ understands that a function declaration can’t take a prvalue as a parameter type. So this is perfectly fine, and produces no warnings whatsoever:

1
2
3
4
5
int main(void) {
  int foo(1);
  foo = 2;
  return foo;
}

This ambiguity is annoying, but has a limited blast radius in this form: int foo() is not an idiomatic way to default-initialize a variable, and so novice C++ programmers can simply be taught to avoid it (and prefer either int foo{} or simply int foo = 0).

Unfortunately, it doesn’t end there (if it did, it wouldn’t be all that vexing!). Consider the following:

1
2
3
4
5
6
7
8
9
10
11
12
struct Widget;

struct WidgetFrobulator {
  WidgetFrobulator(Widget w);
  Widget *get();
};

int main(void) {
  WidgetFrobulator frob(Widget());
  frob.get();
  return 0;
}

Here again, there are two possible semantics:

  1. frob is a WidgetFrobulator, value-initialized with an anonymous instance of Widget
  2. frob is a function that takes a single anonymous parameter2 and returns a WidgetFrobulator

And once again, C++ chooses to preserve uniform declaration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<source>: In function 'int main()':
<source>:9:24: warning: parentheses were disambiguated as a function declaration [-Wvexing-parse]
    9 |   WidgetFrobulator frob(Widget());
      |                        ^~~~~~~~~~
<source>:9:24: note: replace parentheses with braces to declare a variable
    9 |   WidgetFrobulator frob(Widget());
      |                        ^~~~~~~~~~
      |                        -
      |                        {        -
      |                                 }
<source>:10:8: error: request for member 'get' in 'frob', which is of non-class type 'WidgetFrobulator(Widget (*)())'
   10 |   frob.get();
      |        ^~~
Compiler returned: 1

This is a much more annoying case than the first: a moderately experienced C++ programmer could easily write the code above, expecting a normal initialization and method call but getting an inscrutable request for member ... of non-class type error instead.

This parse is vexing for two reasons:

  1. C++ lacks the semantics needed to perform context-dependent disambiguation here: multiple declarations are perfectly permissible, so C++ cannot use the presence of a previous (potentially global) function declaration to treat foo or frob as a definition (plus initialization) instead.
  2. C++’s uniform declaration syntax means that there’s no disambiguation for function declarations, which almost never make sense inside the scope of another function (versus variable declarations, which are routinely declared within other scopes). Instead, C++ simply declares by fiat that anything that looks like a declaration must be a declaration3.

Python packaging: a primer

At a high level, the data model for Python packages looks something like this:

1
Project -> <Release>* -> <Distribution>*

In English: a project can have zero or more releases, each of which can, in turn, have zero or more distributions.

Projects correspond to human-readable identifiers in the global Python packaging namespace: requests, cryptography, &c.

Releases correspond to the human notion of “versions”: the specific set of features, changes, bugfixes, &c that make up a discrete copy of the project at a point in time.

Finally, distributions correspond to the installable assets for a release. These contain the actual code for the release, and come in a surprisingly large number of formats4. For the purposes of this post, we’ll only discuss the two that currently PyPI allows and encourages5:

Package installers generally prefer wheels over source distributions, for all of the reasons stated above (and as an added plus, they’re generally much faster to install).

However, to actually choose a wheel over its corresponding source distribution for a release, a package installer must go through a search process. Roughly9:

  1. Enumerate all compatible releases for the requested project. For example, pip install fakepackage>=8 will search each of 8.1.1, 8.1.0, 8.0.1, &c, as required, in order by semantic version.

  2. In order, for each compatible release: enumerate all wheels for the release, filtering by:

    1. Platform: Wheels for incompatible platforms (e.g. Windows when installing on Linux) are filtered out.
    2. Python version: Wheels can specify a Python version (or range) via Requires-Python; wheels that aren’t compatible with the current Python version are filtered out.
    3. Implementation: Wheels can specify a specific Python interpreter, such as pp for PyPy or cp for CPython; wheels that aren’t compatible with the current implementation (or aren’t py, i.e. implementation-agnostic) are filtered out.
    4. ABI: Wheels can specify an ABI compatibility level, e.g. cp37 to indicate compatibility with the CPython 3.7 ABI; wheels that aren’t ABI compatible with the current implementation (or aren’t abi3, i.e. compatible with the CPython 3 universal ABI) are filtered out.
  3. If any wheels remain after filtering, all are considered compatible with the current host and can be chosen for installation10. If no wheels remain, the source distribution is attempted, if present. If no sdist is present and each release fails all steps above (or any sdist installation fails), resolution is impossible.

This is all pretty complicated! But wait, there’s more: we don’t want to have to download and unpack each candidate distribution in order to filter (or select) it; that would require a lot of time and produce a lot of unnecessary traffic (and therefore hosting burden).

Consequently: PEP 503 specifies the “Simple Repository API”, which provides a basic index that Python package installers can depend on. That index is essentially a big HTML page, with smaller HTML pages for each individual package. Each page contains a list of links (just <a href=...></a>), one per distribution.

Here’s an excerpt from pip-audit’s simple index:

1
2
3
4
5
6
7
8
<a href="https://files.pythonhosted.org/packages/c3/1f/LONG-HASH/pip-audit-2.1.0.tar.gz#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip-audit-2.1.0.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/f6/fe/LONG-HASH/pip_audit-2.1.0-py3-none-any.whl#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip_audit-2.1.0-py3-none-any.whl</a><br/>
<a href="https://files.pythonhosted.org/packages/75/1c/LONG-HASH/pip-audit-2.1.1.tar.gz#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip-audit-2.1.1.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/5e/40/LONG-HASH/pip_audit-2.1.1-py3-none-any.whl#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip_audit-2.1.1-py3-none-any.whl</a><br/>
<a href="https://files.pythonhosted.org/packages/1b/07/LONG-HASH/pip-audit-2.2.0.tar.gz#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip-audit-2.2.0.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/7a/94/LONG-HASH/pip_audit-2.2.0-py3-none-any.whl#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip_audit-2.2.0-py3-none-any.whl</a><br/>
<a href="https://files.pythonhosted.org/packages/98/d9/LONG-HASH/pip-audit-2.2.1.tar.gz#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip-audit-2.2.1.tar.gz</a><br/>
<a href="https://files.pythonhosted.org/packages/cc/43/LONG-HASH/pip_audit-2.2.1-py3-none-any.whl#sha256=LONG-HASH" data-requires-python="&gt;=3.7">pip_audit-2.2.1-py3-none-any.whl</a><br/>

This is the final missing piece: Python package installers fetch these index pages for each package requested. All distributions across all versions are included in the same list, and are disambiguated by their filenames, which are rendered as the text of each link in the index.

Using the excerpts above, we can clearly see how source distributions and wheels have separate filename conventions:

1
2
3
4
5
6
7
8
9
10
pip-audit-2.0.0.tar.gz
pip_audit-2.0.0-py3-none-any.whl
pip-audit-2.1.0.tar.gz
pip_audit-2.1.0-py3-none-any.whl
pip-audit-2.1.1.tar.gz
pip_audit-2.1.1-py3-none-any.whl
pip-audit-2.2.0.tar.gz
pip_audit-2.2.0-py3-none-any.whl
pip-audit-2.2.1.tar.gz
pip_audit-2.2.1-py3-none-any.whl

You’ll observe that there’s a decent amount of structure in the wheel filenames (the ones ending with .whl), while the source distribution names look a little…sparse. This is where our troubles begin.

The vexing parse

To recap: Python package installers go through a resolution process to determine whether (and which) wheel to install. If they can’t find a wheel, they fall back on a source distribution. All of this is done by parsing filenames embedded in links in a PEP 503-standardized simple index hosted by a provider like PyPI.

For wheels, these filenames are standardized: per PEP 491, every wheel filename is of the form:

1
{distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl

…where python tag, &c were described earlier.

As we saw earlier, source distributions have a much simpler filename form:

1
{distribution}-{version}.tar.gz

Source distribution filenames are notably not standardized for source distributions11,12; the closest thing to a standard is this normative guidance in the PyPA guidelines:

The file name of a sdist is not currently standardized, although the de facto form is {name}-{version}.tar.gz, where {name} is the canonicalized form of the project name (see PEP 503 for the canonicalization rules) with - characters replaced with _13, and {version} is the canonicalized form of the project version (see Version specifiers).

(Their formatting; footnote mine.)

Given the description above, this is how pypa/packaging parses an sdist filename:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def parse_sdist_filename(filename: str) -> Tuple[NormalizedName, Version]:
    if filename.endswith(".tar.gz"):
        file_stem = filename[: -len(".tar.gz")]
    elif filename.endswith(".zip"):
        file_stem = filename[: -len(".zip")]
    else:
        raise InvalidSdistFilename(
            f"Invalid sdist filename (extension must be '.tar.gz' or '.zip'):"
            f" {filename}"
        )

    # We are requiring a PEP 440 version, which cannot contain dashes,
    # so we split on the last dash.
    name_part, sep, version_part = file_stem.rpartition("-")
    if not sep:
        raise InvalidSdistFilename(f"Invalid sdist filename: {filename}")

    name = canonicalize_name(name_part)
    version = Version(version_part)
    return (name, version)

Note the comment: a (normalized) PEP 440 version cannot contain dashes, so splitting on the last dash in order to split the {name} and {version} components is perfectly fine.

Here is our vexing parse: there are lots of source distributions with nonstandard versions. PEP 440 wasn’t accepted until August 2014 and wasn’t enforced on PyPI until some point after that14.

For example, here are the sdists and wheels for cffi==1.0.2-2:

1
2
3
4
5
6
7
8
9
10
11
cffi-1.0.2-2-cp26-none-win32.whl
cffi-1.0.2-2-cp26-none-win_amd64.whl
cffi-1.0.2-2-cp27-none-win32.whl
cffi-1.0.2-2-cp27-none-win_amd64.whl
cffi-1.0.2-2-cp32-none-win32.whl
cffi-1.0.2-2-cp32-none-win_amd64.whl
cffi-1.0.2-2-cp33-none-win32.whl
cffi-1.0.2-2-cp33-none-win_amd64.whl
cffi-1.0.2-2-cp34-none-win32.whl
cffi-1.0.2-2-cp34-none-win_amd64.whl
cffi-1.0.2-2.tar.gz

The wheel filenames parse just fine, since -2 is a valid build tag:

1
2
3
>>> from packaging.utils import parse_wheel_filename
>>> parse_wheel_filename("cffi-1.0.2-2-cp27-none-win_amd64.whl")
('cffi', <Version('1.0.2')>, (2, ''), frozenset({<cp27-none-win_amd64 @ 140706518250944>}))

But the sdist doesn’t do too hot:

1
2
3
>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi-1-0-2', <Version('2')>)

Uh oh! packaging (the library that underpins pip and PyPI) thinks that we’re dealing with a package named cffi-1-0-2 version 2, rather than cffi version 1.0.2-2.

To make things more confusing: cffi-1-0-2 is a perfectly valid (and normalized!) distribution name, meaning that it would be perfectly valid to publish a package named cffi-1-0-2 on PyPI.

Is this actually a problem? Well, it depends.

On one hand, recent versions of pip (I tested with 22.0.4) will outright refuse to install or even download cffi==1.0.2-215, because they can’t make sense of the mismatch between its version and metadata:

1
2
3
4
5
6
7
8
$ pip install cffi==1.0.2-2
Collecting cffi==1.0.2-2
  Using cached cffi-1.0.2-2.tar.gz (317 kB)
  Preparing metadata (setup.py) ... done
  WARNING: Requested cffi==1.0.2-2 from https://files.pythonhosted.org/packages/ef/23/c6f7003ebb7b4b3fe4872f112b18ee181a3ec2b137e964093a8b35d4a5bd/cffi-1.0.2-2.tar.gz#sha256=6dc6ae05816e44c71094049321403fda1013013d68506f30914a59683a47fd32, but installing version 1.0.2
Discarding https://files.pythonhosted.org/packages/ef/23/c6f7003ebb7b4b3fe4872f112b18ee181a3ec2b137e964093a8b35d4a5bd/cffi-1.0.2-2.tar.gz#sha256=6dc6ae05816e44c71094049321403fda1013013d68506f30914a59683a47fd32 (from https://pypi.org/simple/cffi/): Requested cffi==1.0.2-2 from https://files.pythonhosted.org/packages/ef/23/c6f7003ebb7b4b3fe4872f112b18ee181a3ec2b137e964093a8b35d4a5bd/cffi-1.0.2-2.tar.gz#sha256=6dc6ae05816e44c71094049321403fda1013013d68506f30914a59683a47fd32 has inconsistent version: filename has '1.0.2.post2', but metadata has '1.0.2'
ERROR: Could not find a version that satisfies the requirement cffi==1.0.2-2 (from versions: 0.1, 0.2, 0.2.1, 0.3, 0.4, 0.4.1, 0.4.2, 0.5, 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.9.0, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2.post2, 1.0.3, 1.1.0, 1.1.1, 1.1.2, 1.2.0.post1, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.7.0, 1.8.2, 1.8.3, 1.9.0, 1.9.1, 1.10.0, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.11.4, 1.11.5, 1.12.0, 1.12.1, 1.12.2, 1.12.3, 1.13.0, 1.13.1, 1.13.2, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0rc1, 1.15.0rc2, 1.15.0)
ERROR: No matching distribution found for cffi==1.0.2-2

I haven’t fully root-caused this, but I’m guessing it’s because pip makes a round-trip through packaging.version.parse in an attempt to produce a PEP 440-compliant version from a “legacy” version:

1
2
3
>>> from packaging.version import parse
>>> parse("1.0.2-2")
<Version('1.0.2.post2')>

Is it a problem anywhere else?

I only know about this because I ran into it with pip-audit, a tool that I helped develop — we have our own dependency resolver (which uses resolvelib internally), and a user reported an incorrect resolution because we had assumed that parse_sdist_filename would work on all source distributions. The fix was thankfully simple: we simply ignore any candidates that don’t match the expected distribution name.

Wrapup

To summarize: when wheel candidates have been exhausted, pip and other Python package installers must fall back on source distributions. To do this correctly, they must parse the version out of the source distribution filename. At the same time, it is incorrect to assume that every source distribution on PyPI (or any other host) has an unambiguously parseable filename. The end result: pip and other tools either fail to install some (theoretically) valid packages, or they must rely on side-channel information to complete the parse. There is no “in-channel” solution; hence “vexing.”

As mentioned at the very beginning: this is not a significant problem for ordinary package consumers in the Python ecosystem in 2022. It primarily rears its head on old packages with “legacy” versions or versions that weren’t properly normalized, and pip itself “circumvents” the problem by simply failing to match the version (which is arguably a bug, but not a particularly important one)16.

What’s more: only source distribution filenames are ambiguous, and wheels are nearly universal for the top packages on PyPI. Modern versions of pip should always prefer wheels over source distributions, and you can completely disable source distributions during resolution with pip install --only-binary=:all:.

It’s also not a bug in any of the current Python packaging code: utilities like packaging.utils.parse_sdist_filename are well-defined with respect to their expected inputs, and only fail because some possible inputs are not well-defined.

Instead, I like to think of it as a little parable on how easy it is to parsing ambiguities (and therefore failure modes) in contexts where parsing might not strictly be necessary. For example: the Simple Index format encourages consumers to parse the filenames, but also supplies some metadata as hidden attributes on each link (for example, Requires-Python becomes a data-requires-python attribute). This pattern could be extended to provide unambiguous fields for each component of the sdist (and wheel!) filenames; maybe I’ll write a PEP to propose that.

And as a closing note: with any luck, none of this will matter soon! If accepted PEP 691 will standardize PyPI’s JSON API, allowing pip to query it instead. The JSON API is much more structured, and requires no filename parsing!


  1. Not to be confused with uniform initialization, which was introduced with C++11 and “solves” the most vexing parse by offering an unambiguous initialization syntax. 

  2. The fact that Widget() gets evaluated to an anonymous function parameter is one of the single most confusing things I’ve ever encountered in C++. It happens because of type decay: Widget() decays to Widget <anonymous>(), i.e. a function (pointer) that takes no parameter and returns a Widget. Terrible! 

  3. This is solved in most other languages: Rust, for example, permits only a single declaration of an identifier in the entire program and has non-uniform syntax for declaring different types: fn ... for function definitions (which are always fully defined, not forward declared) and let (or const) for introducing a binding. 

  4. Among then: bdist_rpm (builds an RPM-compatible distribution), bdist_dmg (for macOS-style .dmg installers), and so forth. Most of these are deprecated or will be shortly

  5. I’m not going to bother talking about eggs, since they’re deprecated on the packager side and strongly discouraged. 

  6. Or the legacy distutils, although hopefully not for long

  7. It’s not actually that simple. But we can pretend that it is, particularly compared to the arbitrary behavior permitted by source distributions. 

  8. Among many other platform specificities: the kind and version of libc being run, the OS itself, &c. 

  9. This is a very rough sketch of how distribution selection works. There are many Python package installers (and many versions of pip), and many tunable knobs for resolution and distribution selection. This is only meant to be the “intuitive” search, not necessarily reflective of how pip install ... will actually resolve a distribution on your machine. 

  10. I’m not actually sure which is chosen consistently, if any. I would expect it to be the “most specific” wheel (i.e. closest ABI, closest Python Version, &c) but I couldn’t find a guarantee for (or source code indicating) that behavior. 

  11. They’re very weakly standardized by PEP 517, which only asserts the current format as a historical format. 

  12. That being said, PEP 625 is currently on the standards track and will solidify a filename format that roughly resembles the current one (with .sdist instead of .tar.gz and with wheel-style normalization). 

  13. This doesn’t appear to be done in practice. For example, pip-audit==2.2.1 on PyPI has an sdist of pip-audit-2.2.1.tar.gz, not pip_audit-2.2.1.tar.gz. There’s been some discussion on that discrepancy in this thread

  14. I’m not sure when. Please let me know if you know! 

  15. It works with --use-deprecated=legacy-resolver, but that can’t be relied on in the future. 

  16. I was also told that pip will disambiguate the distribution name from the version since it already knows the distribution name as part of pip install name, but I wasn’t able to actually reproduce this (since so many of the packages with “legacy” versions don’t pass metadata validation at all). But that doesn’t mean it isn’t correct; I just couldn’t find an example on PyPI. 


Discussions: Reddit Twitter