ENOSUCHBLOG

Programming, philosophy, pedaling.


High-quality Rust data models for GitHub Actions

Feb 26, 2024     Tags: devblog, programming, rust    


Another announcement-type post, this time for a data-modeling crate for GitHub Actions: github-actions-models. Docs here.

The short version: github-actions-models provides idiomatic serde-compatible structs and enums for inspecting and walking over the contents of GitHub Actions components, including actions, workflows, and Dependabot definitions.

These can be loaded in directly from their YAML representations, e.g. the following action.yml from actions/setup-python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# https://github.com/actions/setup-python/blob/e9d6f990972a57673cdb72ec29e19d42ba28880f/action.yml
---
name: "Setup Python"
description: "Set up a specific version of Python and add the command-line tools to the PATH."
author: "GitHub"
inputs:
  python-version:
    description: "Version range or exact version of Python or PyPy to use, using SemVer's version range syntax. Reads from .python-version if unset."
  python-version-file:
    description: "File containing the Python version to use. Example: .python-version"
  cache:
    description: "Used to specify a package manager for caching in the default directory. Supported values: pip, pipenv, poetry."
    required: false
  architecture:
    description: "The target architecture (x86, x64) of the Python or PyPy interpreter."
  check-latest:
    description: "Set this option if you want the action to check for the latest available version that satisfies the version spec."
    default: false
  token:
    description: "The token used to authenticate when fetching Python distributions from https://github.com/actions/python-versions. When running this action on github.com, the default value is sufficient. When running on GHES, you can pass a personal access token for github.com if you are experiencing rate limiting."
    default: ${{ github.server_url == 'https://github.com' && github.token || '' }}
  cache-dependency-path:
    description: "Used to specify the path to dependency files. Supports wildcards or a list of file names for caching multiple dependencies."
  update-environment:
    description: "Set this option if you want the action to update environment variables."
    default: true
  allow-prereleases:
    description: "When 'true', a version range passed to 'python-version' input will match prerelease versions if no GA versions are found. Only 'x.y' version range is supported for CPython."
    default: false
outputs:
  python-version:
    description: "The installed Python or PyPy version. Useful when given a version range as input."
  cache-hit:
    description: "A boolean value to indicate a cache entry was found"
  python-path:
    description: "The absolute path to the Python or PyPy executable."
runs:
  using: "node20"
  main: "dist/setup/index.js"
  post: "dist/cache-save/index.js"
  post-if: success()
branding:
  icon: "code"
  color: "yellow"

can be loaded with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
use github_actions_models::action::{Action, Runs};

fn main() {
    // unwraps, panics, etc. for brevity.

    let contents = std::fs::read_to_string("action.yml").unwrap();
    let action = serde_yaml::from_str::<Action>(&contents).unwrap();

    let Runs::JavaScript(runs) = action.runs else { panic!("expected JS action") };
    assert_eq!(runs.using, "node20");
    assert_eq!(runs.main, "dist/setup/index.js");
    assert_eq!(runs.post.unwrap(), "dist/cache-save/index.js");
    assert_eq!(runs.post_if.unwrap(), "success()");
}

(corresponding to models like Action and Runs).

Rationale

I really, really hate writing code like this (written as Python for brevity):

1
2
3
4
5
6
7
8
action = yaml.loads(content)

runs = action["runs"] # KeyError if `runs` is missing
using = runs["using"] # KeyError if `using` is missing;
                      # TypeError if `using` is not a dict

if using == "node20": # am i sure `using` is a `str`?
    ...

While arguably compulsive of me and unnecessary in many settings (e.g. scripting ones), I hate code that allows errors at the data/parsing layer (e.g., an unexpected or missing value in the loaded YAML) to percolate into the “business logic” layer. Conversely, I love code that is correct with respect to its constructed types, e.g.:

1
2
3
4
5
# `action` is constructed if and only if the loaded YAML is a well-formed Action
action = Action.from_yaml(content)

if action.runs.using == "node20":
    ...

Haskell-y people call this pattern “parse, don’t validate”; I generally think of it as an extension of the “full recognition before parsing” slogan from LANGSEC.

Generally speaking, this kind of data modeling is an important first step towards building reliable, high-confidence tools: high quality data models mean that data quality issues stop at the “edge” of the system, rather than manifesting somewhere deep in the system at the least convenient time possible.

In this particular case: I’ve been looking into building a general framework for statically analyzing GitHub Actions configurations. There are pre-existing tools like actionlint and Scorecard, but I want something that can be like clippy but for CI/CD configurations (e.g., allowing me to rapidly develop rules for specific actions and their configurations).

The first step towards doing that is being able to confidently load an invariant-preserving model of the configuration files being analyzed, hence this work.

Alternatives & next steps

“Why not use JSON Schema? GitHub Actions has definitions on Schema Store,” I hear you say.

Dear reader: I tried. I would love to not write data models by hand, and instead automatically generate them from machine-readable schemata. In practice, doing that (in Rust, at the minimum) has been a significant pain in the ass:

Going forwards, I’ll probably continue to hand-roll these (and deal with the consequences). In terms of things on the horizon:


  1. For more evidence of this, read the JSON Schema specification page and try to determine which subsets of which draft versions of the specification your implementation supports. I’ve used everything from Draft 4 to Draft 7 to Draft “2020-12,” with no intuition for the ordering or feature changes between those versions.