This is a list of large datasets that I find interesting to use for ongoing projects and investigations, and granular enough to slice up for teaching examples.

For reasons that will become obvious if you peruse the data sources' homepages, I don't list direct links to download the data. Like everything else about data, the storing and distribution of data is more complicated than it seems. Some of the datasets here are not to be publicly distributed. Others don't come in easily importable files. Still others are time-sensitive to the collection process. You can get a taste of the problem by visiting my separate datajanitor/diaries repo, which includes the collection process for some of these sets.

I may refine the listing by adding topics (e.g. politics, entertainment) and categorizations (e.g. 'big', 'very big') and references to the extract-transform-load process as necessary. For now, this is just a draft list to help me organize data useful to me and my students.

San Francisco 311 Calls

311 calls to San Francisco from July 1, 2008 to present. Contains location information, resposible agency, opened and closed dates.

(via City of San Francisco)

The best way to explore the dense, multi-faceted data produced by the U.S. Census.

(via United States Census data, as organized by the CensusReporter project)
Yelp Dataset Challenge

Reviews, users, and businesses, check-ins, and tips for the cities of Phoenix, Las Vegas, Madison, Waterloo, and Edinburgh.

(via Yelp)
USGS Earthquake Archive

Create a CSV of earthquakes and customize the timeframe and geography parameters.

(via United States Geological Survey)
U.S. Congress social media activity

Social media accounts of sitting U.S. legislators. Only contains account IDs, not actual data.

(via Crowdsourced)
IMDB actor and movie data

Massive plain text data dumps for various fields of info collected by IMDB, including actor biographies, movie box-office performance, and cast lists.

(via IMDB)
FBI Uniform Crime Reports

A collection of various categories of crime statistics from data received by partciptating law agencies. The FBI’s data tool is a little unwiedly; raw data sets can be obtained through the National Archive of Criminal Justice Data

(via Department of Justice)
Las Vegas restaurant inspection data

Results of health inspections and violations found

(via South Nevada Health District)
U.S. Baby Names

Number of names given to U.S. babies per year, by state

(via Social Security Administration)
Department of Defense Excess Property Program

Line-item listing of military equipment distributed to civilian law enforcement agencies since 2006.

(via Department of Defense, collected by National Public Radio)
NYPD Stop and Frisk Data

Stops made by the NYPD from 2003 to present. Fields include geographical coordinates, race of the stopped person, and cause for and result of the stop.

(via New York Police Department)
New York Times Bestsellers List

Rank history for specific titles

(via New York Times)
New York taxi rides

Fields include time and location of pickup and dropoffs, as well as trip cost and distance. Raw data must be requested via FOIL. Read Chris Whong’s FOILing NYC’s Taxi Trip Data

(via NYC Taxi and Limousine Commission)
Census Quickfacts

Summary population data and demographics per state and county.

(via U.S. Census)
Car complaints defects, investigations, and recalls

Contains records and narratives of accidents in which a car defect is suspected as a cause. Includes any subsequent investigations and recalls.

(via NHTSA, Office of Defects Investigation)
Car safety ratings

Safety ratings, features, and characteristics of cars sold in the U.S.

(via NHTSA)
California payroll data

Salary and compensation information for public entities in California, including 58 counties, more than 450 cities, more than 2,900 special districts, more than 100 higher education providers, and most state employees

(via California State Controller's Office)
New York payroll data

Payroll information for city employees

(via New York City Comptroller)
U.S. airline on-time statistics

Percentage of late flights, by airline, origin, and destination

(via Bureau of Transportation Statistics)
U.S. airline ticket costs

Sample of 10% of airline itineraries

(via Bureau of Transportation Statistics)
U.S. accident and incident data

Preliminary report and investigation data from air carrier safety incidents.

(via Federal Aviation Administration)
Wildlife Strike Database

Voluntary reports of aircraft collisions with wildlife.

(via Federal Aviation Administration)
Massachusetts transportation data

Anonymous vehicle-use data in Massachusetts released as part of a hackathon

(via Massachusetts Department of Transportation)

This comprehensive, crowdsourced database of company information, including notable employees, investors, and funding rounds, has accompanied TechCrunch’s coverage over the years. It’s not perfect, but it’s as detailed (and used) of a business dataset you’ll find for startups and acquisitions.

(via TechCrunch via crowdsourcing)
NYC 311 calls

Millions of complaints about pot holes and loud music in the city that never sleeps.

(via New York City)
Year of Gun Deaths, 2013

After the Newtown school shooting, Slate began a crowdsourced interactive to collect all gun-related deaths in America for the year 2013.

(via Slate and crowdsourcing)
NYC restaurants inspections data
(via New York City health department)
White House Visitors Log

The names of just about everyone who has visited President Obama’s White House, when they visited, and who they met.

(via White House)
Chicago crime incidents
(via Chicago Police Department)
San Francisco crime incidents

Incidents reported via the SFPD CABLE System, from 1/1/2003 to present. In KML format (past 90 days) In Shapefile format (by year) In CSV format (all: 800,000 plus records).

(via San Francisco Police Department)
NYC subway turnstile data

How many people enter and exit New York’s subways

(via Metropolitan Transportation Authority)
Payments made to doctors from medical companies

A federally-mandated listing of payments from drug and device companies to doctors and teaching hospitals. ProPublica has a good list of the caveats.

(via Centers for Medicare & Medicaid Services)
Food and drug recalls

A list of press releases of recalls for FDA-regulated products

(via Food and Drug Administration)
Drug adverse effects reports

A database of voluntary reports of adverse side effects and medication errors by healthcare professionals and consumers

(via Food and Drug Administration)
Federal employment opportunities

A JSON feed of federal job openings, including description and salary range

(via USAJobs)
California lobbying
U.S. federal campaign finance

Who is giving and who is receiving campaign contributions

(via Federal Election Commission)
U.S. federal lobbying

Documents relevant to the Lobbying Disclosure Act

(via Senate Office of Public Records)
NOAA Tides and Currents API

Provided as an API allowing you to select all kinds of measurements for different stations and timeframes

(via NOAA's Center for Operational Oceanographic Products and Services)
Medicare Provider Utilization and Payment Data

A 1.7GB file of 9 million records on which doctors were reimbursed for which procedures in 2012.

(via Centers for Medicare & Medicaid Services)
Harvard and MIT MOOC Student Data

Anonymized data containing records of individual’s activities in edX course.

(via Harvard-MITx)
Jeopardy! game data

A crowdsourced database of Jeopardy! questions and answers and performances.

(via J! Archive)
Education Common Core

A set of fiscal and non-fiscal data about each public school in the United States

(via U.S. Department of Education, National Center for Education Statistics) music and artists data

Includes info and taxonomy about musicians, albums, tracks, and performances

Clincal research trials

A registry and results database of publicly and privately supported clinical studies of human participants conducted around the world.

(via U.S. National Institutes of Health)
Congressional record

Who’s in Congress, what they’ve proposed and what they’ve said

(via U.S. Congress)