Feb 26, 2024 Tags: devblog, programming, rust
Another announcement-type post, this time for a data-modeling crate
for GitHub Actions: github-actions-models
. Docs here.
The short version: github-actions-models
provides idiomatic
serde
-compatible struct
s and enum
s for inspecting
and walking over the contents of GitHub Actions components, including
actions, workflows, and Dependabot definitions.
These can be loaded in directly from their YAML representations, e.g.
the following action.yml
from actions/setup-python
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# https://github.com/actions/setup-python/blob/e9d6f990972a57673cdb72ec29e19d42ba28880f/action.yml
---
name: "Setup Python"
description: "Set up a specific version of Python and add the command-line tools to the PATH."
author: "GitHub"
inputs:
python-version:
description: "Version range or exact version of Python or PyPy to use, using SemVer's version range syntax. Reads from .python-version if unset."
python-version-file:
description: "File containing the Python version to use. Example: .python-version"
cache:
description: "Used to specify a package manager for caching in the default directory. Supported values: pip, pipenv, poetry."
required: false
architecture:
description: "The target architecture (x86, x64) of the Python or PyPy interpreter."
check-latest:
description: "Set this option if you want the action to check for the latest available version that satisfies the version spec."
default: false
token:
description: "The token used to authenticate when fetching Python distributions from https://github.com/actions/python-versions. When running this action on github.com, the default value is sufficient. When running on GHES, you can pass a personal access token for github.com if you are experiencing rate limiting."
default: ${{ github.server_url == 'https://github.com' && github.token || '' }}
cache-dependency-path:
description: "Used to specify the path to dependency files. Supports wildcards or a list of file names for caching multiple dependencies."
update-environment:
description: "Set this option if you want the action to update environment variables."
default: true
allow-prereleases:
description: "When 'true', a version range passed to 'python-version' input will match prerelease versions if no GA versions are found. Only 'x.y' version range is supported for CPython."
default: false
outputs:
python-version:
description: "The installed Python or PyPy version. Useful when given a version range as input."
cache-hit:
description: "A boolean value to indicate a cache entry was found"
python-path:
description: "The absolute path to the Python or PyPy executable."
runs:
using: "node20"
main: "dist/setup/index.js"
post: "dist/cache-save/index.js"
post-if: success()
branding:
icon: "code"
color: "yellow"
can be loaded with:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
use github_actions_models::action::{Action, Runs};
fn main() {
// unwraps, panics, etc. for brevity.
let contents = std::fs::read_to_string("action.yml").unwrap();
let action = serde_yaml::from_str::<Action>(&contents).unwrap();
let Runs::JavaScript(runs) = action.runs else { panic!("expected JS action") };
assert_eq!(runs.using, "node20");
assert_eq!(runs.main, "dist/setup/index.js");
assert_eq!(runs.post.unwrap(), "dist/cache-save/index.js");
assert_eq!(runs.post_if.unwrap(), "success()");
}
(corresponding to models like Action
and Runs
).
I really, really hate writing code like this (written as Python for brevity):
1
2
3
4
5
6
7
8
action = yaml.loads(content)
runs = action["runs"] # KeyError if `runs` is missing
using = runs["using"] # KeyError if `using` is missing;
# TypeError if `using` is not a dict
if using == "node20": # am i sure `using` is a `str`?
...
While arguably compulsive of me and unnecessary in many settings (e.g. scripting ones), I hate code that allows errors at the data/parsing layer (e.g., an unexpected or missing value in the loaded YAML) to percolate into the “business logic” layer. Conversely, I love code that is correct with respect to its constructed types, e.g.:
1
2
3
4
5
# `action` is constructed if and only if the loaded YAML is a well-formed Action
action = Action.from_yaml(content)
if action.runs.using == "node20":
...
Haskell-y people call this pattern “parse, don’t validate”; I generally think of it as an extension of the “full recognition before parsing” slogan from LANGSEC.
Generally speaking, this kind of data modeling is an important first step towards building reliable, high-confidence tools: high quality data models mean that data quality issues stop at the “edge” of the system, rather than manifesting somewhere deep in the system at the least convenient time possible.
In this particular case: I’ve been looking into building a general framework
for statically analyzing GitHub Actions configurations. There are pre-existing
tools like actionlint
and Scorecard, but I want something that can be
like clippy
but for CI/CD configurations (e.g., allowing me to rapidly
develop rules for specific actions and their configurations).
The first step towards doing that is being able to confidently load an invariant-preserving model of the configuration files being analyzed, hence this work.
“Why not use JSON Schema? GitHub Actions has definitions on Schema Store,” I hear you say.
Dear reader: I tried. I would love to not write data models by hand, and instead automatically generate them from machine-readable schemata. In practice, doing that (in Rust, at the minimum) has been a significant pain in the ass:
There are JSON Schema to Rust generators, like schemafy
and typify
.
These are fantastic, when they support the version of JSON Schema you have
on hand and the schema’s author hasn’t taken advantage of every bizarre corner
of the JSON Schema standard.
In practice, these (along with JSON Schema’s impressive ability to represent equivalent type shapes in an infinite number of ways) means that schema-to-code generators tend to overindex on specific schema “shapes” for idiomatic code generation, or only end up supporting whatever subset of the specification appears in their immediate downstream use case. These properties aren’t the generator’s fault; they indicate a failure to produce a tractable schema format1.
Even at its best, most code generated from a reference schema (or IDL, or any
other source) has that distinct air of machine generation: things like
auto-generated field names (Variant0001
), insufficient deduplication
of types across modules, &c. These are stupidly hard problems to solve
while trying to preserve a reasonably small schema design, but they also mean
that most schema-generated code just isn’t very pleasant to interact with.
Going forwards, I’ll probably continue to hand-roll these (and deal with the consequences). In terms of things on the horizon:
I’d like to add models for the various GitHub event payloads, i.e. the JSON
blob that gets delivered with the pull_request
(or whatever) webhook event.
I intend to actually use these models in another, hitherto unreleased, tool.
For more evidence of this, read the JSON Schema specification page and try to determine which subsets of which draft versions of the specification your implementation supports. I’ve used everything from Draft 4 to Draft 7 to Draft “2020-12,” with no intuition for the ordering or feature changes between those versions. ↩