Thank you to everyone who attended today’s informational session about the Stanford Computational Journalism Lab. If you weren’t able to come by, feel free to sign up for our mailing list, and/or get in contact with us via email and social media.
During the presentation, I had a slide listing all the datasets I could think of in 5 minutes that were public, important, “big” (100,000 records or more), and “easy” to get – at least compared to the days when you had to send a government agency a shrink-wrapped hard drive:
Not all the datasets are literally just one-click-to-download – you may have to re-read the wget manual or write a quick scraper – but they are freely accessible to anyone with an Internet connection. Here’s the list of datasets with URLs:
- Daily pollution readings for every EPA sensor
- Florida inmate population
- Texas inmate population
- The time of each U.S. major air carrier flight since 1985 and how late it was
- Every fatality from a traffic accident since 1975
- Every reported vehicle defect, investigation, and recall in the U.S.
- Every campaign donation to a U.S. federal candidate
- Every NYC taxi ride in the past 5 years including how much it cost + tip
- SFPD police reported incidents
- Dallas police reported incidents
- Los Angeles police reported incidents
- Chicago police reported incidents
- Medicare payments for every doctor for every procedure they performed and billed for
- Total number of prescriptions dispensed by every doctor and drug in Medicare’s Part D program
- All registered Congressional lobbyists and what issues they lobbied on
- Every one who has visited the White House since December 2009
- The race, age, and sex of everyone the NYPD stopped on the street and frisked
- Every workplace fatality [that resulted in an investigation] and workplace inspection
- Every report of an injury or fatality believed to be related to a FDA-regulated drug
- California schools SAT, ACT, AP scores
- California schools standardized test scores
- California schools demographics
- California schools vaccination rates
- Every thing and person paid for by Congressional office funds
Those are just what I could fit into the slide. There are many more, of course:
- All U.S. registered clinical trials
- California state payroll
- Florida state payroll
- New York state payroll
- New Jersey state payroll: centralized system and non-centralized system
- New York City restaurant inspections
- Nationwide crime and stop-and-search reports for the UK
- Congressional bills and roll call votes
- Earthquake events detected by the U.S. Geological Survey network
- Surplus military equipment distributed from the Pentagon to civilian law enforcement
- 311 Requests for San Francisco. And for Dallas , Miami-Dade County, Los Angeles, Chicago, and New York.
- Recalls and civil/criminal penalties via the U.S. Consumer Product Safety Commission
- Payments made by health care companies to doctors
- New York subway turnstile data
- A variety of climate data from the National Oceanic and Atmospheric Administration, in API and archive form.
- A survey containing 10% of airfare tickets for major U.S. airlines.
- Student completion and earnings for all undergraduate degree-granting U.S. colleges from 1996 to 2015
- Baby names by year, sex, and state, according to the Social Security Administration
- All U.S. federal contracts, grants, and loans
- NYPD motor vehicle collisions
- Seattle Real Time Fire 911 Calls
- NIST’s national database of software security flaws
- Reported incidents of civilian aircraft hitting birds and other animals
- Frequently Occurring Surnames from the Census 2000
Many cities and states are (with, in my opinion, surprising enthusiasm) uploading their machine-readable data to Socrata, which provides easy CSV and JSON data exports; here are the Socrata portals for San Francisco and New York City.
The examples above include only the machine-readable datasets that come directly from purported official sources. This leaves out independently curated data that pertains to public affairs, such as the Supreme Court Database. For federal lawmaking and lawmaker data, the sheer size of Sunlight Foundation’s projects page and GovTrack’s bulk data hint at the breadth of their data-parsing efforts, which have created comprehensive databases from text intended for dead trees.
It’s also worth knowing about ProPublica’s Data Store – not just because it’s my former employer – but because their catalog contains free downloads of different years and versions of various datasets mentioned above. For example, they FOIAed the Medicare Part D Prescribing Data for 2011 and 2012 – which they used in their Prescriber Checkup project. The Centers for Medicare & Medicaid Services has only posted the 2013 data.
Natural language and data mining
This is a little off-topic, but people usually also want datasets that can be used for machine learning and natural language. That’s not my particular expertise but I’ll list a few examples that I know about.
There are plenty of private datasets for data mining: Yelp’s Academic Dataset is probably one of the easiest one-click datasets for interesting text tied to categories and sentiment (i.e. star reviews). And the Internet Movie Database has plaintext dumps of some of their data (though not reviews). And I’ve always wanted to more closely examine the types of questions asked (and incorrectly answered) on Jeopardy; here’s a scrape of j-archive.com. Visit r/datasets for a variety of independently collected datasets, including the corpus of 1.7 billion comments.
The Stanford Network Analysis Project has a large number of datasets geared towards network analysis, including the Enron email dump. If you’re interested in a more recent, albeit one-man-network, check out Jeb Bush’s emails, which he released in response to requests as he began his presidential run.
There are plenty of public text sources besides disclosed emails. But many require an intermediate level of programming ability and a non-trivial amount of patience to collect. If you can muster that, transcripts and rulings (now with diffs!) aren’t terribly difficult to scrape from the U.S. Supreme Court website. Press releases from lawmakers and agencies are another plentiful source. My colleague Justin Grimmer in the Political Science department makes it a habit to practice reproducible research: here’s a Github repo of the 72,000+ U.S. Senate press releases he collected, and here’s his resulting paper: A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases. (2010).
Sunlight has an API for Congressional speeches. And the Federal Register has an API for the text in its rule-making and notices. And Regulations.gov has an API to track not only all of the pending federal regulations but all of the public comments submitted – not sure why I don’t see much (proclaimed) use of it given all the very timely information it potentially contains.
This is barely scratching the surface of freely available (and mostly public) data – I didn’t even mention the federal data.gov portal or the U.S. Census data. I’ve highlighted these datasets because they have plenty of rows while being relatively self-contained and (usually) having decent, if not fun-to-read documentation.
(And one more example, just because it’s too big to not mention: Amazon’s AWS Public Data Sets page is an overwhelming collection of massive and free data sets.)
Prepare to research
I don’t claim that these datasets are easy to analyze, or that they are as complete as they purport to be. In fact, prepare to be frequently frustrated when trying to conduct even the most straightforward of analyses.
For example, a simple aggregate count of the White House visitor records on the
visitee_namefirst fields – i.e. the name of the White House official being visited – would seem to yield clues about who is “important” at the White House – unless visitors are listed as visiting an official’s scheduling assistant. Or if the most powerful White House officials have their meetings outside of the White House, as the New York Times reported:
“They’re here all the time — all day,” Andre Williams, a manager at Caribou Coffee, said of his White House customers. (He can spot White House officials by the security badges around their necks, or the Secret Service agents lurking nearby.)
“A lot of them like lattes — that or a ‘depth charge,’ a coffee with a shot of espresso,” Mr. Williams said. “The caffeine rush — they need it.”
Some administration officials and lobbyists say that meeting away from the White House allows officials to get some air without making visitors go through the cumbersome White House security process. Others, however, acknowledge that one motivation is the desire to avoid lobbyists’ names showing up too often on the White House logs.
Despite these inherent problems of the dataset, enterprising news organizations can find interesting bits when they know how to focus their search, such as Politico’s report on secret visits by The Daily Show’s Jon Stewart.
As we mentioned in our presentation, computational journalism requires not just technical skill, but the ability to conduct in-depth research and investigation of the institutions and the processes that produce these datasets. This domain knowledge is needed for effective and sane data-wrangling. However, don’t let that intimidate you – even just knowing the existence of these datasets should be enough to inspire fun explorations and project ideas.
If you find something interesting, we’d love to hear from you. And if studying and practicing computational journalism at Stanford interests you, take a look at the courses we offer in the winter and spring.