Walking the unfinished third section of the High Line Park, New York

Public Affairs Data Journalism

This is the homepage for the Stanford course COMM 273D, Public Affairs Data Journalism I as taught by Hearst Professional in Residence, Dan Nguyen.

If you're a casual visitor, you're probably most interested in the Tutorials. You can also visit the class sites for my two other electives this year:

Fall 2015

Session: September 21 to December 4, 2015

Hours: Tuesdays and Thursdays, 1:30 to 2:50PM

Location: McClatchy Hall (Building 120), Room 410

The Fall 2015 site, currently under construction, will be at: http://2015.padjo.org

The Fall 2014 site is archived here: http://fall2014.padjo.org.

Contact: Check the course syllabus for office hours. You can also email me at my stanford.edu address: (dun).


Core concepts

Our primary goal is to learn how to argue with and against data. To understand the business of our government, including the power it wields over – or yields to – our institutions, then we must understand data, the byproduct of that business, and often, its fuel.

We see data as a means of understanding and, when necessary, critiquing the "data-driven decisions" in public affairs. Our focus is on concepts rather than technology and mathematical problem solving over statistical methods. The core of our work is ultimately the same core of traditional reporting: the initiative to question, the independent research of facts, and the desire to inform the public.

Our three main strategies:

  1. To systematically collect and count the data that, for various reasons, has not been officially collected, e.g. Homicide Watch and ProPublica's When Caregivers Harm
  2. To efficiently analyze large – but otherwise benign – datasets to find evidence of something significant, e.g. the Sun Sentinel's Above the Law: Speeding Cops, Reuters' Water's Edge, Center for Investigative Reporting's On Shaky Ground
  3. To join and compare datasets, e.g. all of the above

Techniques and tools

img

Knowing how to use the tools – and why they are needed – allows us to efficiently explore and analyze data ourselves, which is key to understanding data concepts in a meaningful way.

I've tried to keep the tools and programs as agnostic as possible:

Plain-text editor

img

There is no perfect data tool. So get used to importing and exporting data from spreadsheet to database to web application – and back. The common interface between all of these systems is just text.

This is why most of the data we'll encounter will be plaintext, i.e. delimiter-separated values, or, will be content (such as from PDFs) that we'll want to convert to plain text.

A plain-text editor (as opposed to a rich-text editor, such as Microsoft Word) can be the most effective way to explore and clean up a dataset before you attempt to analyze it with a spreadsheet or database.

We will cover a few text-based concepts: using Markdown to write web-and-human-friendly documents. And regular expressions, a kind of Find-and-Replace on steroids.

The best general purpose, cross-platform text editor is Sublime Text, which has a 30-day free trial. For Windows, there is the free Notepad++. For Mac, there is the free TextWrangler

Text-editor skills
  • The purpose of delimiters
  • Regular expressions

Spreadsheet

img

Spreadsheets are the most accessible way to explore, analyze, and visualize small to medium-sized (100,000 rows) datasets. They're also handy for just organizing and structuring your notes and thoughts (which can effortlessly be turned into "data" later). Modern spreadsheets typically have a layer of "data types", i.e., "July 4, 1776" is no longer plaintext, but a date, which allows the user to apply data-specific formulas, e.g. =WEEKDAY(some_cell_with_a_date) to get the day of week.

(However, the assumptions made by a spreadsheet can sometimes destroy data, which is why it's necessary to deal with data as just text).

In class, we'll be using Google Spreadsheets for their mostly-intuitive interface and online collaboration features. However, Microsoft Excel and the open-source LibreOffice can also be used.

Spreadsheet skills
  • Sorting and filtering
  • Data organization
  • Pivot tables and aggregation
  • Quick visualizations

Database

img

For most of our data analysis and exploration, a relational database will be overpowered and awkward. However, for the joining of datasets, i.e. two different tables that share the same column (such as an ID number), there is not a more direct way to do so than to express our desire than through Structured Query Language, i.e. SQL.

When you write SQL, you will technically be programming. However, to not get lost in the weeds, we'll only be covering a basic set of SQL commands, and will frequently be exporting data from a database into a spreadsheet.

We'll be accessing databases through graphical user interfaces. For all platforms, you can install the Firefox web browser and then, the SQLite Manager plugin. On Mac, the best free database GUI is Sequel Pro

Database skills
  • Inner and outer joins
  • Foreign key relationships

Optional tools:

  • OpenRefine (Macs and PCs) - OpenRefine looks like a spreadsheet but has much narrower use case: the cleaning of data. But it's the best at what it does, and the difference between being average and good at data cleaning can easily make or break an entire project.
  • Mou (for Macs) - A Markdown editor that lets you preview your work.
  • Github desktop client - For when you tire of using the web interface.
  • Google Fusion Tables - a hybrid of online database and visualization tool. Particularly useful for simple mapping of large datasets.
  • ModeAnalytics - A kind-of "Github for MySQL", allowing collaboration and sharing of datasets and queries via the web browser.

COMM 273D is the prerequisite for COMM 274D: Public Affairs Data Journalism II, taught in the winter by Cheryl Phillips.

I am also teaching the following two classes (open to undergraduates outside of the journalism program):

How I see the relationship between the 3 courses I'm teaching:

  • Public Affairs Data Journalism I covers the work and process of journalism, as well as the tools and concepts needed to efficiently understand data – data, that for the most part, I will have extracted and cleaned for class use.
  • Computational Methods covers the problem-solving skills needed to collect and explore data for public accountability stories, as well as additional techniques for visualizing and publishing data journalism.
  • Computational Journalism focuses on using computational methods and problem-solving to produce a project with civic impact, whether it be a web application or via other mediums.

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.