Oct 25, 2021 Tags: art, data, devblog, python
TL;DR: I’ve written a Twitter bot that posts pictures from the DOCUMERICA program. The code for getting the DOCUMERICA photos, building a DB, and the Twitter bot itself is all here.
I’m taking a break this month from toiling in the LLVM mines to do something a little bit lighter.
From 1972 to 1977, the U.S.’s Environmental Protection Agency ran the DOCUMERICA program: dozens of freelance photographers across the country were paid to “photographically document subjects of environmental concern.”
I’ve loved the DOCUMERICA photos ever since I first saw them. Most of the photographers paid through the program interpreted the task broadly, leaving us an incredible archive of 1970s American society and its environs:
Over 80,000 photos were taken through DOCUMERICA, of which approximately 22,000 were curated for storage at the National Archives. From those 22,000 or so, 15,992 have been digitized and are available as an online collection.
The National Archives has done a fantastic job of keeping the DOCUMERICA records, but I’ve always wanted to (try to) expose them to a wider audience, once that might understandably be less interested in crawling through metadata to find pieces of Americana. I’ve already written a few Twitter bots and the DOCUMERICA photos seemed a good target for yet another, so why not?
To the best of my knowledge, the National Archives is the only (official) repository for the photographs taken through the DOCUMERICA program. They also appear to have uploaded a more curated collection to Flickr, but it’s not nearly as complete.
Fortunately, the National Archives have a Catalog API that includes searching and bulk exporting. Even more fortunately1 (and unusually for library/archive APIs), it actually supports JSON!
The Catalog API allows us to retrieve 10,000 results per request and there are only around 16,000 “top-level” records in the DOCUMERICA volume, so we only need to do two requests to get all of the metadata for the entire program:
1
2
3
base="https://catalog.archives.gov/api/v1"
curl "${base}/?description.item.parentSeries.naId=542493&rows=10000&sort=naId%20asc" > 1.sort.json
curl "${base}/?description.item.parentSeries.naId=542493&rows=10000&offset=10000&sort=naId%20asc" > 2.sort.json
A few funny things with this:
description.item.parentSeries.naId
) is doing here: it’s telling
the API to limit results to only those entities whose parent series is 542493
, i.e.
the top-level identifier for the DOCUMERICA program.542493
but also their children (i.e., individual little blobs of JSON for every variant
of every file). Those children fortunately (and coincidentally?) don’t have sequential NAIDs while
the first-level child DOCUMERICA records do, so adding the sort gets us exactly what we want.From here, we can confirm that we got as many records as we expected:
1
2
$ jq -c '.opaResponse.results.result | .[]' 1.sort.json 2.sort.json | wc -l
15992
Structuring archival data is a complicated, unenviable task. For reasons that escape my hobbyist understanding, the layout returned by the National Archives API has some…unusual features:
results
key is not an array, but a dictionary. The actual array of results
is in result.result
.Array
(e.g. geographicReferenceArray
) that
are actually dictionaries. I don’t know if this is an Archives-specific terminology choice,
or whether it implies that these values can be arrays if more than one datapoint is available.Similarly to the above: the object
and file
keys use the same mixed dictionary-array
pattern. By way of example, here’s dictionary form of object
from just a single record:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
"objects":{
"@created":"2015-01-01T00:00:00Z",
"@version":"OPA-OBJECTS-1.0",
"object":[
{
"@id":"14676552",
"@objectSortNum":"1",
"technicalMetadata":{
"size":"137930",
"mime":"image\/gif",
"Chroma_BlackIsZero":"true",
"Chroma_ColorSpaceType":"RGB",
"Chroma_NumChannels":"3",
"Compression_CompressionTypeName":"lzw",
"Compression_Lossless":"true",
"Compression_NumProgressiveScans":"4",
"height":"600",
"width":"405"
},
"file":{
"@mime":"image\/gif",
"@name":"01-0237a.gif",
"@path":"content\/arcmedia\/media\/images\/1\/3\/01-0237a.gif",
"@type":"primary",
"@url":"https:\/\/catalog.archives.gov\/OpaAPI\/media\/542495\/content\/arcmedia\/media\/images\/1\/3\/01-0237a.gif"
},
"thumbnail":{
"@mime":"image\/jpeg",
"@path":"opa-renditions\/thumbnails\/01-0237a.gif-thumb.jpg",
"@url":"https:\/\/catalog.archives.gov\/OpaAPI\/media\/542495\/opa-renditions\/thumbnails\/01-0237a.gif-thumb.jpg"
},
"imageTiles":{
"@path":"opa-renditions\/image-tiles\/01-0237a.gif.dzi",
"@url":"https:\/\/catalog.archives.gov\/OpaAPI\/media\/542495\/opa-renditions\/image-tiles\/01-0237a.gif.dzi"
},
},
{
"@id":"209221188",
"@objectSortNum":"2",
"file":[
{
"@mime":"image\/jpeg",
"@name":"412-DA-00002_01-0237M.jpg",
"@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg",
"@type":"primary",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg"
},
{
"@mime":"image\/tiff",
"@name":"412-DA-00002_01-0237M.TIF",
"@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF",
"@type":"archival",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF"
}
],
"thumbnail":{
"@mime":"image\/jpeg",
"@path":"opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg"
},
"imageTiles":{
"@path":"opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi"
}
},
{
"@id":"209439452",
"@objectSortNum":"3",
"file":[
{
"@mime":"image\/jpeg",
"@name":"412-DA-00002_01-0237M.jpg",
"@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg",
"@type":"primary",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg"
},
{
"@mime":"image\/tiff",
"@name":"412-DA-00002_01-0237M.TIF",
"@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF",
"@type":"archival",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF"
}
],
"thumbnail":{
"@mime":"image\/jpeg",
"@path":"opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg"
},
"imageTiles":{
"@path":"opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi",
"@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi"
}
}
]
}
I took this as a challenge to practice my jq
skills, and came up with this mess:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
jq -c \
'.opaResponse.results.result |
.[] |
{
naid: .naId,
title: .description.item.title,
author: .description.item.personalContributorArray.personalContributor.contributor.termName,
date: .description.item.productionDateArray.proposableQualifiableDate.logicalDate,
files: [
.objects.object | if type == "array" then . else [.] end |
.[] |
(.file | if type == "array" then . else [.] end)
] | flatten | map(select(.))
}' 1.sort.json 2.sort.json > documerica.jsonl
The files
filter is the only really confusing one here. To break it down:
objects.object
, and normalize it into an array if it isn’t already oneobject
, we get its file
, normalizing that into an array if it isn’t already oneflatten
our newly minted array-of-arrays, and filter down to non-null
valuesThe end result is a JSONL stream, each record of which looks like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
"naid": "542494",
"title": "DISCARDED PESTICIDE CANS",
"author": "Daniels, Gene, photographer",
"date": "1972-05-01T00:00:00",
"files": [
{
"@mime": "image/gif",
"@name": "01-0236a.gif",
"@path": "content/arcmedia/media/images/1/3/01-0236a.gif",
"@type": "primary",
"@url": "https://catalog.archives.gov/OpaAPI/media/542494/content/arcmedia/media/images/1/3/01-0236a.gif"
},
{
"@mime": "image/jpeg",
"@name": "412-DA-00001_01-0236M.jpg",
"@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg",
"@type": "primary",
"@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg"
},
{
"@mime": "image/tiff",
"@name": "412-DA-00001_01-0236M.TIF",
"@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF",
"@type": "archival",
"@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF"
},
{
"@mime": "image/jpeg",
"@name": "412-DA-00001_01-0236M.jpg",
"@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg",
"@type": "primary",
"@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg"
},
{
"@mime": "image/tiff",
"@name": "412-DA-00001_01-0236M.TIF",
"@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF",
"@type": "archival",
"@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF"
}
]
}
In the process, I discovered that 69 of the 15,992 records made available online don’t have any photographic scans associated with them. I figure that this is probably a human error that happened during archiving/digitalization, so I sent an email to NARA asking them to check. Hopefully they respond!
For the curious, there are the NAIDs for the records that are missing photographic scans:
545393 551930 552926 552939 552954 553031 553033 553864 556371 558412 558413 558415 558416 558417 558419 558420 558421 558422 558423 558424 558425 558426 558427 558428 558429 558430 558431 558432 558433 558434 558435 558436 558437 558438 558439 558440 558441 558442 558443 558444 558445 558446 558447 558448 558449 558450 558451 558452 558453 558454 558455 558456 558457 558458 558459 558460 558461 558462 558463 558464 558465 558466 558467 558468 558469 558470 558472 558473 558474
Update: The Still Pictures group at NARA responded with this explanation:
The photograph described under National Archives Identifier 556371 (412-DA-13919) contains the subject’s social security number, and therefore access to the image is fully restricted under the FOIA (b)(6) exemption for Personal Information. The other 68 items have a Use Restriction(s) status (located in the Details section of each Catalog description) of “Undetermined”. It is unclear why this status was chosen; nowadays we typically include a note with information about any known or suspected use restrictions (like copyright) in the item description. That said, this series was digitized in the late 1990s, and as far as I can tell these 68 items were not scanned with the rest of the series at the time.
I love JSONL as a format, but it isn’t ideal for keeping track of a Twitter bot’s state (e.g., making sure we don’t tweet the same picture twice).
Despite doing hobby and open-source programming for over a decade at this point, I haven’t used a (relational) database in a personal project ever2. So now seemed like a good time to do so.
The schema ended up being nice and simple:
1
2
3
4
5
6
7
8
CREATE TABLE documerica (
naid INTEGER UNIQUE NOT NULL, -- the National Archives ID for this photo
title TEXT, -- the photo's title/description
author TEXT NOT NULL, -- the photo's author
created TEXT, -- the ISO8601 date for the photo's creation
url TEXT NOT NULL, -- a URL to the photo, as a JPEG
tweeted INTEGER NOT NULL -- whether or not the photo has already been tweeted
)
(It’s SQLite, so there’s no proper date/time types to use for created
).
The full script that builds the DB from documerica.jsonl
is
here.
Technology wise, there’s nothing too interesting about the bot: it uses
python-twitter
for Twitter API access.
I ran into only one hiccup, with the default media upload technique: it seems as though
python-twitter
’s PostUpdate
API runs afoul of Twitter’s media chunking expectations
somewhere. My hack fix is to use UploadMediaSimple
to upload the picture first,
and then add it to the tweet. As a bonus, this makes it much easier to add alt text!
1
2
3
4
5
6
media_id = api.UploadMediaSimple(io)
api.PostMediaMetadata(media_id, photo["title"])
api.PostUpdate(
f"{tweet}\n\n{archives_url(photo['naid'])}",
media=media_id,
)
That’s just about the only thing interesting about the bot’s source. You can, of course, read it for yourself.
At the end of the day, here’s what it looks like:
This was a nice short project that exercised a few skills that I’ve used less often in the last few years. Some key takeaways for me:
jq
filters from memory, but my understanding of the filter language breaks
down at anything more complicated than map
and select
. It’s a fantastic tool that saved me
from writing a much messier transformation script in a more general-purpose language, so I should
make an effort to become even more familiar with it.Or perhaps not, as the subsequent jq
wrangling in this post will demonstrate. ↩
In case this surprises you: I use them regularly at work. ↩
I acknowledge that it’s hard to prevent abuse if you encourage more people to write bots. But there are lots of technical solutions to this, like having the bot signup flow require connection to a human’s Twitter account. ↩