COMM 273D | Fall 2014

Thursday, November 6

Storytelling with Data Visualization

While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.

Topics

  • Review of the midterm
  • The importance of good data in visualizations
  • How visualization can augment the Serial podcast

Homework

  • Proposal for data project Due by next class

Jump to the full details on homework assignments

Midterm answers

Here they are

A Golden Rule of Data Visualization

Try not to kill anyone with your visualization.

or: Would you use this data visualization if your life depended on it?

img

via The New York Times's Elizabeth Bumiller, "We Have Met the Enemy and He Is PowerPoint":

The slide has since bounced around the Internet as an example of a military tool that has spun out of control. Like an insurgency, PowerPoint has crept into the daily lives of military commanders and reached the level of near obsession. The amount of time expended on PowerPoint, the Microsoft presentation program of computer-generated charts, graphs and bullet points, has made it a running joke in the Pentagon and in Iraq and Afghanistan.

“PowerPoint makes us stupid,” Gen. James N. Mattis of the Marine Corps, the Joint Forces commander, said this month at a military conference in North Carolina. (He spoke without PowerPoint.) Brig. Gen. H. R. McMaster, who banned PowerPoint presentations when he led the successful effort to secure the northern Iraqi city of Tal Afar in 2005, followed up at the same conference by likening PowerPoint to an internal threat.

“It’s dangerous because it can create the illusion of understanding and the illusion of control,” General McMaster said in a telephone interview afterward. “Some problems in the world are not bullet-izable.”

Showing more information is never automatically "better" – we as humans typically have an (over)abundance of information, but suffer from a scarcity of attention. Hence, a sloppy, convoluted graphic concerning something vital should be considered as possibly life-threatening.

Data visualization and understanding

via Stephen Few, in "Now You See It: Simple Visualization Techniques for Quantitative Analysis"

We should never forget that a picture of data is not the goal; it's only the means. Information visualization is all about gaining understanding so we can make good decisions.

New York Times 2014 Live Senate Model

The NYT Upshot team produced the most enlightening and important of yesterday's Senate election graphics. At around 5PM, early reporting of the Virginia Senate race indicated that Republican challenger Ed Gillespie held a 10+ point lead in the counted votes. This was astounding considering that Democratic incumbent Mark Warner had consistently polled at 10 percent or higher. From Huffington Post's Pollster (which also had excellent visualizations):

img

Looking solely at the live coverage of the Virginia count, it would be easy to think that Gillespie was on his way to a major upset. However, the Upshot had prepared a Live Senate Model (subtitle: "Who is Really Winning the Senate So Far?") to add more nuance to the live counts. From their methodology page, titled "How to Watch the Elections Like an Expert" (emphasis added):

On election night, knowing who is ahead isn’t enough.

In states like Virginia, Georgia and elsewhere, cities have traditionally been among the last to report their votes. Because cities lean Democratic, this pattern means that returns for Democratic candidates will appear weaker than they actually are for a few hours after the polls close. Similarly, in states with slow-counting Republican areas, Republican candidates will seem weaker than they actually are.

The worst election analysts use these changes to tell comeback stories, asserting that a candidate was trailing early in the night but made up ground.

More sophisticated analysts interpret leads through the lens of the outstanding votes. “There’s a lot of votes left to be counted in heavily Democratic Cuyahoga County,” Jeff Greenfield said on CNN in 2004. “Remember, some of the votes outstanding are down here in Marion County where Obama is winning,” John King said on the same network in 2008.

This year, The Upshot will aim to let you be your own John King. In about a dozen of the closest Senate races, we, like many others, will track the leads reported by The Associated Press. But we will also adjust those leads based on what we know about where the votes have come from. Our adjusted leads will be based solely on current and historical returns. They will not use data from exit polls, or any forecasts from Senate models. You’ll be able to find a link to the tracker on The Times’s midterm page and The Upshot’s home page.

This is a pretty concise explanation of their intent. But to drive the point home, they offer a visualization of how their model would have worked had the Upshot existed in 2012 and was tracking the Virginia results for the Obama/Romney race. In that race, Romney had a more than a 20 percent lead over Obama at 8PM. However, Upshot's methodology of considering historical voter trends was mostly unwavering in projecting an Obama victory:

img

Here's the textual description of that graphic:

At 10 p.m., with about 75 percent of precincts reporting, Mr. Obama trailed Mr. Romney by a few percentage points. Some people following only the raw vote totals came to believe that Mr. Romney had a good chance to win the state. In truth, it was unlikely that Mr. Romney’s lead would hold, because counties like Alexandria, Arlington and Fairfax — containing Democratic-leaning Washington suburbs — had many votes outstanding. The Upshot’s adjusted lead would have anticipated these votes, and would have been relatively stable by that time. (If our adjusted lead were perfectly clairvoyant, the line would have been flat.)

To be fair to CNN and John King in last night's results coverage, they frequently referred to the makeup of Virgina's late-reporting precincts. But the Upshot's visualization can communicate the story of both the actual Senate results and the historical prediction at the same time. In other words, the Upshot's viz told 3 important stories in a single glance:

  1. The up-to-the-minute actual results, including Warner's early and seemingly unsurmountable lead.
  2. What the Upshot predicts will actually happen when all the results are tallied.
  3. Which precincts still have yet to report, and how they voted in previous elections (note: the map was more interesting last night, when more precincts had still to report)

img

Gillespie's vote count was unexpectedly high throughout the night, but not enough to beat Warner's effort and the historical trends of late-reporting Virginia precincts.

A compelling story without the visuals

The Upshot's election graphics are excellent in aesthetic and execution. But don't overlook the journalistic insight that led to their creation: that is, the realization that it's not only the up-to-the-date current information that is relevant, but the knowledge that vote counts happen in a particular order, and that the history matters. The reason why the Upshot graphics are so enlightening is because the journalism is solid.

And also: the ability to gather historical precinct level data is not trivial. Good graphics need good data.

A sidenote: FiveThirtyEight's Dan Hopkins asked, "Should a Close Virginia Race Surprise Us?"

There are more structural reasons to think that this Senate race could be close. Incumbents don’t have the advantage they used to. Our politics are also increasingly nationalized, leaving candidates with less room to craft a brand independent of their party’s. And Virginia is a closely contested state — with the exception of Warner’s prior bid in 2008 and the GOP blow-out in the 2009 gubernatorial election, recent races there have been close.

When visualization is superior to narration

Some of you have been following the Serial podcast, in which This American Life producer Sarah Koenig attempts to track down what really happened in the murder of Baltimore teen Hae Min Lee in 1999. It's a testament to how good the storytelling is (and how powerful and completely captive simple storytelling forms can be) that people are wondering if there's a Great Podcast Renaissance.

*Episode 5: Route Talk**

But not all details are clear even in the best linear storytelling. So the Serial podcast site includes supplemental graphics to help readers understand the table of the many cell phone calls in the relevant timeframe and a map of the key locations:

img

img

Both visualizations are helpful, but are still almost too cluttered to actually clarify much.

So this is how Reddit user VYshouldhavewon distilled the information into a visual timeline:

Already listened to the new episode 3 times this morning. Thought a visual timeline would be helpful for anyone trying to wrap their mind around it…

(answering a question about how it was created):

Just some colored lines in Adobe Illustrator. Props the type designers over at what used to be Hoefler & Freret Jones, they created this beautiful type! Thank you again.

Very frequently, there will be no single visualization tool that will do what you need. So be prepared to manually create a graphic to fit your story.

img

img

Don't be afraid to hand-make things

If a tool can't do exactly what you want it to do, then just combine it with other tools. Here's a Google Fusion map with Photoshopped labels, by Carolina Wilson of the Peninsula Press:

Here's the text:

Map data sources via Mountain View Voice: Death #1, Nov. 7, 2011, Abbas Vahidi / Death #2, April 9, 2012, Erick Onorato / Death #3, June 21, 2012, William Ware / Death #4, Sept. 15, 2012, Joshua Baker / Death #5, March 4, 2013, Ruifan Ma / Death #6, April 3, 2013, Sarra Golukhov. Map produced by Carolina Wilson/Peninsula Press.

img

Homework

  • Proposal for data project

    Due by next class

    Prepare an outline for your data project. It should be at least two pages long and should include:

    • Why you find this story interesting, and/or why it will be interesting to readers.
    • Why you think data is important to tell this story
    • A rough outline on what you believe you will find.
    • At least three data sources that you have in your possession or that you have a good idea how to get (very soon), and the structure of the data records.
    • Links to and summaries of at least three stories that have already been published on the subject.
    • If you think a visualization is in order, a rough sketch (hand-drawn is fine) of what this visualization will look like.
    • A rough timeline for the next 3 weeks of where you plan to be in your reporting on these stories.

    Your data project can be based on what you’re doing in the other reporting classes. Or, if you don’t think your current beat lends itself to a data project, you’re free to pursue a subject that you are interested in, even if it’s at a national level.

    How the project will be evaluated

    Overall, the key question that I will be asking is: did you tell an interesting story and make a compelling argument that could not have been made without looking at data? What type or how many visualizations you use, how big the dataset is, etc., are obviously related topics.

    So far, I have not judged homework assignments on writing/editing/formatting/etc., but what you turn in as your final project should be in a form that if we wanted to, we could publish in the Peninsula Press.

    This project should take at least 30 hours to research.

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.