ENOSUCHBLOG

Programming, philosophy, pedaling.


A small DOCUMERICA Twitter bot

Oct 25, 2021     Tags: art, data, devblog, python    

This post is at least a year old.

TL;DR: I’ve written a Twitter bot that posts pictures from the DOCUMERICA program. The code for getting the DOCUMERICA photos, building a DB, and the Twitter bot itself is all here.


I’m taking a break this month from toiling in the LLVM mines to do something a little bit lighter.

From 1972 to 1977, the U.S.’s Environmental Protection Agency ran the DOCUMERICA program: dozens of freelance photographers across the country were paid to “photographically document subjects of environmental concern.”

I’ve loved the DOCUMERICA photos ever since I first saw them. Most of the photographers paid through the program interpreted the task broadly, leaving us an incredible archive of 1970s American society and its environs:

Over 80,000 photos were taken through DOCUMERICA, of which approximately 22,000 were curated for storage at the National Archives. From those 22,000 or so, 15,992 have been digitized and are available as an online collection.

The National Archives has done a fantastic job of keeping the DOCUMERICA records, but I’ve always wanted to (try to) expose them to a wider audience, once that might understandably be less interested in crawling through metadata to find pieces of Americana. I’ve already written a few Twitter bots and the DOCUMERICA photos seemed a good target for yet another, so why not?

Getting the data

To the best of my knowledge, the National Archives is the only (official) repository for the photographs taken through the DOCUMERICA program. They also appear to have uploaded a more curated collection to Flickr, but it’s not nearly as complete.

Fortunately, the National Archives have a Catalog API that includes searching and bulk exporting. Even more fortunately1 (and unusually for library/archive APIs), it actually supports JSON!

The Catalog API allows us to retrieve 10,000 results per request and there are only around 16,000 “top-level” records in the DOCUMERICA volume, so we only need to do two requests to get all of the metadata for the entire program:

1
2
3
base="https://catalog.archives.gov/api/v1"
curl "${base}/?description.item.parentSeries.naId=542493&rows=10000&sort=naId%20asc" > 1.sort.json
curl "${base}/?description.item.parentSeries.naId=542493&rows=10000&offset=10000&sort=naId%20asc" > 2.sort.json

A few funny things with this:

From here, we can confirm that we got as many records as we expected:

1
2
$ jq -c '.opaResponse.results.result | .[]' 1.sort.json 2.sort.json | wc -l
15992

Normalization

Structuring archival data is a complicated, unenviable task. For reasons that escape my hobbyist understanding, the layout returned by the National Archives API has some…unusual features:

I took this as a challenge to practice my jq skills, and came up with this mess:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
jq -c \
  '.opaResponse.results.result |
   .[] |
   {
      naid: .naId,
      title: .description.item.title,
      author: .description.item.personalContributorArray.personalContributor.contributor.termName,
      date: .description.item.productionDateArray.proposableQualifiableDate.logicalDate,
      files: [
        .objects.object | if type == "array" then . else [.] end |
        .[] |
        (.file | if type == "array" then . else [.] end)
      ] | flatten | map(select(.))
   }' 1.sort.json 2.sort.json > documerica.jsonl

The files filter is the only really confusing one here. To break it down:

  1. We take the value of objects.object, and normalize it into an array if it isn’t already one
  2. For each object, we get its file, normalizing that into an array if it isn’t already one
  3. We flatten our newly minted array-of-arrays, and filter down to non-null values

The end result is a JSONL stream, each record of which looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
  "naid": "542494",
  "title": "DISCARDED PESTICIDE CANS",
  "author": "Daniels, Gene, photographer",
  "date": "1972-05-01T00:00:00",
  "files": [
    {
      "@mime": "image/gif",
      "@name": "01-0236a.gif",
      "@path": "content/arcmedia/media/images/1/3/01-0236a.gif",
      "@type": "primary",
      "@url": "https://catalog.archives.gov/OpaAPI/media/542494/content/arcmedia/media/images/1/3/01-0236a.gif"
    },
    {
      "@mime": "image/jpeg",
      "@name": "412-DA-00001_01-0236M.jpg",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg",
      "@type": "primary",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg"
    },
    {
      "@mime": "image/tiff",
      "@name": "412-DA-00001_01-0236M.TIF",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF",
      "@type": "archival",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF"
    },
    {
      "@mime": "image/jpeg",
      "@name": "412-DA-00001_01-0236M.jpg",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg",
      "@type": "primary",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg"
    },
    {
      "@mime": "image/tiff",
      "@name": "412-DA-00001_01-0236M.TIF",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF",
      "@type": "archival",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF"
    }
  ]
}

In the process, I discovered that 69 of the 15,992 records made available online don’t have any photographic scans associated with them. I figure that this is probably a human error that happened during archiving/digitalization, so I sent an email to NARA asking them to check. Hopefully they respond!

For the curious, there are the NAIDs for the records that are missing photographic scans:

  545393 551930 552926
552939 552954 553031
553033 553864 556371
558412 558413 558415
558416 558417 558419
558420 558421 558422
558423 558424 558425
558426 558427 558428
558429 558430 558431
558432 558433 558434
558435 558436 558437
558438 558439 558440
558441 558442 558443
558444 558445 558446
558447 558448 558449
558450 558451 558452
558453 558454 558455
558456 558457 558458
558459 558460 558461
558462 558463 558464
558465 558466 558467
558468 558469 558470
558472 558473 558474
  

Update: The Still Pictures group at NARA responded with this explanation:

The photograph described under National Archives Identifier 556371 (412-DA-13919) contains the subject’s social security number, and therefore access to the image is fully restricted under the FOIA (b)(6) exemption for Personal Information. The other 68 items have a Use Restriction(s) status (located in the Details section of each Catalog description) of “Undetermined”. It is unclear why this status was chosen; nowadays we typically include a note with information about any known or suspected use restrictions (like copyright) in the item description. That said, this series was digitized in the late 1990s, and as far as I can tell these 68 items were not scanned with the rest of the series at the time.

Database-ification

I love JSONL as a format, but it isn’t ideal for keeping track of a Twitter bot’s state (e.g., making sure we don’t tweet the same picture twice).

Despite doing hobby and open-source programming for over a decade at this point, I haven’t used a (relational) database in a personal project ever2. So now seemed like a good time to do so.

The schema ended up being nice and simple:

1
2
3
4
5
6
7
8
CREATE TABLE documerica (
    naid INTEGER UNIQUE NOT NULL, -- the National Archives ID for this photo
    title TEXT,                   -- the photo's title/description
    author TEXT NOT NULL,         -- the photo's author
    created TEXT,                 -- the ISO8601 date for the photo's creation
    url TEXT NOT NULL,            -- a URL to the photo, as a JPEG
    tweeted INTEGER NOT NULL      -- whether or not the photo has already been tweeted
)

(It’s SQLite, so there’s no proper date/time types to use for created).

The full script that builds the DB from documerica.jsonl is here.

The bot

Technology wise, there’s nothing too interesting about the bot: it uses python-twitter for Twitter API access.

I ran into only one hiccup, with the default media upload technique: it seems as though python-twitter’s PostUpdate API runs afoul of Twitter’s media chunking expectations somewhere. My hack fix is to use UploadMediaSimple to upload the picture first, and then add it to the tweet. As a bonus, this makes it much easier to add alt text!

1
2
3
4
5
6
media_id = api.UploadMediaSimple(io)
api.PostMediaMetadata(media_id, photo["title"])
api.PostUpdate(
    f"{tweet}\n\n{archives_url(photo['naid'])}",
    media=media_id,
)

That’s just about the only thing interesting about the bot’s source. You can, of course, read it for yourself.

Wrapup

At the end of the day, here’s what it looks like:

A screenshot from Twitter

This was a nice short project that exercised a few skills that I’ve used less often in the last few years. Some key takeaways for me:


  1. Or perhaps not, as the subsequent jq wrangling in this post will demonstrate. 

  2. In case this surprises you: I use them regularly at work. 

  3. I acknowledge that it’s hard to prevent abuse if you encourage more people to write bots. But there are lots of technical solutions to this, like having the bot signup flow require connection to a human’s Twitter account. 


Discussions: Reddit