COMM 273D | Fall 2014

Tuesday, November 11

Dirty data, cleaned dirt cheap

One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.


  • Dirty data
  • OpenRefine
  • Clustering


  • Clean up a dataset via data clustering Due by next class

Jump to the full details on homework assignments

A special note about data cleaning: This week's lessons will probably drive you crazy. It's more important than ever to think: Why am I doing this? What problem am I trying to solve? What is the kind of thing that I'm trying to understand?

Don't get too involved in seeing things as an arbitrary mix of tools and techniques to memorize, or you're going to want to quit civilized life to live in the mountains.

Data "civilians"

From the New York Times, "For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights", Aug. 17, 2014:

The field known as “big data” offers a contemporary case study. The catchphrase stands for the modern abundance of digital data from many sources — the web, sensors, smartphones and corporate databases — that can be mined with clever software for discoveries and insights. Its promise is smarter, data-driven decision-making in every field. That is why data scientist is the economy’s hot new job.

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

Cleaning names for "Dollars for Docs"

My biggest project at ProPublica was Dollars for Docs. A key component to our success was just being good at cleaning data.

Across all the different pharma companies who disclose financial relationships with doctors, a doctor's name could be spelled in many different ways. So, how to find doctors which have several relationships with different companies?


Cleaning with OpenRefine

Using a Pivot Table or GROUP BY in SQL won't quite work:


Note: This is a relatively simple example. Imagine a database of tens of thousands of names with all manner of variations in spelling, nevermind typos. Hand-editing with a spreadsheet won't cut it.

So we use OpenRefine to cluster the names.

Two tutorials I've quickly put together to look at:

Other resources:

Cleaning and categorization by OpenSecrets, by Center for Responsive Politics, purports to show how much money a political candidate gets from a particular industry:


But the data looks like this:


Here's how they explain their methodology:

About the data in these industry profiles

These profiles show the results of more than two decades of research by the Center for Responsive Politics on the funding of federal elections. While these are the most accurate numbers currently available, we constantly refine the data and will update the figures as often as possible. Totals for the current election cycle are typically updated every three to four weeks. Data from earlier years is updated as time and resources allow.

All numbers attributed to a particular industry can be assumed to be conservative. Tens of millions of dollars of contributions in each election cycle are not classified by industry at all — either because the original data is incomplete or too vague to categorize, or because of limitations on the Center's ability to fully research the millions of individual contributions given over the years.

The Center is the only organization that invests in categorizing campaign contributions by industry in a way that includes individuals' contributions, not just money from political action committees (PACs). Here's the logic behind our methodology, whether it's the oil/gasindustry, pharmaceutical industry or any of the hundreds of industries and interest groups we track on this site: Since corporations and other organizations are prohibited from making direct political contributions from their treasuries to political candidates, you have to look at the contributions from people associated with the institution to gauge its political persuasion and how it might be trying to exert influence in Washington.

Working with FEC data

Let's get a taste of the mess that OpenSecrets tries to clean up.

The FEC provides zipped CSV files by U.S. state of individual contributions to presidential candidates.


Here's the direct links to the Iowa individual contributions files:

You can use this data for the homework

Random notes

Fun story from the Center for Investigative Reporting:

RIVERSIDE, Calif. – Mike Soubirous is a prodigious water user, pumping more than 1 million gallons per year at his lushly landscaped home on a hot, windy Southern California hilltop.

Soubirous also is a member of the Riverside City Council, which in July voted unanimously to impose tough new water conservation rules in this desert city of 317,000.

Last month at Riverside Councilman Mike Soubirous’ home, sprinklers were seen running seven nights in a row. Credit: Stuart Palley for CIR Yet as California’s drought worsened from 2012 to 2013, he consumed enough water to supply eight California households – more than any other top water official in the state, records show.

The source of the records:

Water bills obtained via the state’s Public Records Act show that in 2013, nearly half of the officials who supervise the state’s biggest water agencies used more water than the typical California household.

On the subject of the midterm and LESO data

I mentioned this in the answers, but Florida Today did a great investigation of how Brevard County uses military surplus to supplement its equipment:

More than 27 years ago, in April 1987, William Cruse armed himself with an assault rifle and other weapons and stalked through two Palm Bay grocery stores, killing two police officers and four civilians. It is considered the worst mass-shooting in Brevard County history, and officials from two agencies interviewed by FLORIDA TODAY cited the incident in explaining the need for high-powered military weapons and armored trucks for protection.

Put simply: The officers who responded that day were out-gunned.

Palm Bay Police Chief Doug Muldoon said the officers were armed with revolvers and shotguns and one of the slain officers was killed while reloading.

The Brevard County Sheriff's Aviation Unit, a five-chopper fleet that rescues stranded individuals and searches for suspects, is made up entirely of military surplus. One of the helicopters was shot down twice in Vietnam, Chief Pilot John Coppola said.

The most recent addition is a UH-1H Huey chopper that arrived in May 2013 and will be used to help in firefighting and rescue operations. Coppola said it cost $2,000, but it needed some work. Replacing the machine's rotor blades and rotor shaft, as well as other maintenance, cost $12,000, Coppola said. To buy retail, he estimated it would have cost closer to $400,000.

Uber and FEC expenditures data

VICE Motherboard + data journalism - Reporter Kaleigh Rogers cites an analysis by consulting firm Hamilton Place Strategies: Congress Is Taking More Uber Rides than a Drunken Undergrad with a Tax Rebate

HPS pulled data from Federal Election Committee and found members of Congress took about 2,800 Uber rides for less than $100 so far in 2014, compared to just 1,800 taxi rides. That’s a huge increase compared to 2012, when Uber was in its infancy: Congress only took about 100 Uber rides under $100 and about 2,800 low-cost taxi rides.

Chart via Hamilton Place Strategies:


HPS pulled data from Federal Election Committee and found members of Congress took about 2,800 Uber rides for less than $100 so far in 2014, compared to just 1,800 taxi rides. That’s a huge increase compared to 2012, when Uber was in its infancy: Congress only took about 100 Uber rides under $100 and about 2,800 low-cost taxi rides.

It’s a notable shift, considering Uber—along with Lyft, Airbnb and other so-called “peer-to-peer” businesses—are lobbying Congress and other levels of government to ease up regulations and allow their business model to thrive. Meanwhile, groups that represent more traditional businesses (like the New York Taxi Workers’ Alliance) are trying to get the government to put a leash on the competition.

Note the data-journalism-mentality behind this analysis:

  • Look at the present data (2014 expenditures)
  • Look at past data (2010 and 2012)
  • Look at a specific slice of data relevant to the topic at hand (expenditures used for staffer transportation)
  • Compare two competing sub-slices of data: Taxi vs Uber rides
  • And, compare those sub-slices across all possible time frames to see a trend.


Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.