Programming, philosophy, pedaling.

Analyzing UMD Arrest Data

Dec 19, 2015     Tags: data, programming, umd    


There isn’t much context to this post.

I go to UMD, and I thought it might be interesting to analyze arrest statistics for the campus and surrounding area (which is policed by both the school’s force and county police).

Retrieving the Data

The Good News

Unlike many police departments, UMPD actually publishes detailed and timely information on incidents and arrests. The reports are even organized chronologically and categorized by type (notices, incidents, arrests, etc) to boot.

The Bad News

Unfortunately for me, the reports look like this:

This format is produced by an amusingly long (single-line!) <table> with entries of the following form (indentation added):

	<td rowspan="2" valign="middle">19959</td>
	<td>01/01/15 02:46</td>
	<td align="center">2015-00000013</td>
	<td align="center" />
	<td align="center" />
	<td align="center" />
	<td colspan="5">(Driving, Attempting to drive) veh. while impaired by alcohol; (Driving, Attempting to drive) veh. while under the influence of alcohol</td>

After about 15 minutes of experimenting, I came up with umd-arrest-ledger.rb, which takes a single year as an argument and dumps all UMPD arrests for that year as a (big) JSON file:

$ umd-arrest-ledger.rb 2014 # leaves 2014.json in $CWD

A sample:

From here, a simple loop:

$ for i in $(seq 2010 2015); do umd-arrest-ledger.rb $i; done

And we have all the data we need in the form of 5 well-formatted JSON files.


All graphs come from the 2010-2015 dataset (as of 12/19/15).

Some raw numbers:

Age, Sex, and Race


A slim majority (2080, 50.4%) of cases provided no age.

2047, or 49.6% of all arrests had documented ages:

Unsurprisingly, the vast majority of arrests in a college town were found in the adolescent to young adult range. Relatively few teenage and middle-aged arrests were made, suggesting that the UMPD generally leaves town policing to the PGPD.

Broken down further, 1458 (82.9%) of the arrests in the 17-33 range were in the average undergraduate/graduate age range (17-25).


A substantial minority (644, 15.6%) of cases provided no sex.

The other 3483 cases (84.5%) had documented sexes:

The ratio of male arrests to female arrests was roughly 5:1.

National arrest rates from 1990 to 2010 show a male-to-female arrest ratio of roughly 3:1, making this ratio slightly higher than the national average.

Explanations for this discrepancy vary (read: I don’t have any good ones). My best guess is that the UMPD tends to be lenient towards petty crimes committed by women, but further analysis is required to confirm that.


A substantial minority (653, 15.8%) of cases provided no race.

The other 3474 cases (84.2%) had documented races:

45.5% of documented-race arrests were Black, while 49.4% were White. Another 4.5% of documented arrests were Asian or Pacific Islander.

According to the Office of Institutional Research, Planning & Assessment’s Undergraduate Student Profile, 12.8% of undergraduates were Black, compared to 51.7% White and 16.7% Asian.

The 2010 Census for College Park, MD showed comparable percentages for the town population (14.3% Black, 63% White, 12.7% Asian).

Based on these numbers, Black students and town citizens were significantly overrepresented in arrests from 2010 to 2015, at over 3 times their percentage of the population. This number is similar, unfortunately, to nationwide racial arrest statistics (see national arrest rates above).

Whites were slightly overrepresented, possibly because the UMPD’s arrest records do not distinguish between Caucasians and Hispanic Americans/Latinos (who make up 9.3% of the campus population and 11.9% of the town population).

Arrest Distribution

Determining the actual distribution of arrests was slightly trickier.

Although the records followed a standard arrest reason format, the reasons often overlapped or had similar language. For example, both DUI/DWI and unlicensed driving arrests make use of the phrase “(Driving, Attempting to drive)”.

As a result, I resorted to some ugly regular expressions to get an approximation of common arrests. Since multiple charges are recorded in a single arrest, there is a certain level of overlap in these figures:

DUI/DWI, Possession of Paraphernalia, and Possession of Marijuana dominate, the first possibly being an unfortunate consequence of College Park being a very pedestrian-unfriendly town.

The latter don’t particularly surprise me either (just about everybody at UMD has a story about a friend or hallmate being busted). Similarly, the figures for Underage Possession (of Alcohol) come with the field — it’s college.

The 34 records of “No Charges” were slightly surprisingly to me, as it means that those 34 were arrested on suspicion alone (which isn’t a very strong standard).

Concluding Notes and Future Analysis

First of all, it’s important to note that this was not an analysis of confirmed crimes or convicted criminals. The arrest data says nothing about the legal status of the arrested or the proceedings of their trials. It’s often easy to forget that arrest and conviction are not the same.

This was a fairly cursory analysis of the data I scraped, and I’m certain that I left plenty of stones unturned.

In particular, it would be interesting to analyze arrest types as a function of date/time of year.

The relationship between arrest types, age, and time of year (particularly the school semesters) could also be interesting.

I leave these to someone whose interest in data science exceeds mine and who can operate analysis tools more complex than Ruby and Gnuplot.

If you’d like to do your own analysis of the data (and I highly encourage you to!), I’ve bundled my script and scrapes (as well as Gnuplot scripts and data files) for download here.

I also wrote a similar script, umd-incident-logs.rb, which performs a similar JSON dump for incident reports (not arrests). I didn’t use its data in this blog post, but it could definitely be used to add detail and/or context to the arrests.

- William