COMM 273D | Fall 2014

Thursday, October 30

Predicting the elections

With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.


  • Statistical significance
  • Poll reliability
  • Forecasting


  • Analyze polling methodologies Due by next class
  • Make your own poll-aggregating forecast Due by next class
  • Senate prediction pool Due by next class
  • Campaign finance data Due by next class

Jump to the full details on homework assignments

Polling and forecasting

How to do an Election Survey

via Chapter 9, "How to do an Election Survey" in Philip Meyer's "Precision Journalism", 1991 edition:

The following statement is true, even though almost everyone involved in election polling denies it.

The purpose of an election survey is to predict the outcome of the election. Editors, pollsters, and pundits will curse, evade, or ignore the truth of that statement, sometimes with great heat. But it is still true. And if you are going to do election surveys, you might as well get used to this simple fact: your success or failure will be judged by how well your poll predicts the outcome. It is a reasonable test, and a fair one.

…An election poll is, of course, good for other things than predicting the outcome of elections. It can show what issues are motivating the voters. It can measure familiarity with the issues and the candidates. It can show what coalitions are being formed or renewed. It can provide insights into candidate strategy that are in turn derived from the candidate's own polls.

But to do any of these good things, the poll must be a valid representation of the participating electorate. And the election is the check on whether it succeeds at that. This chapter is about the things you can do to make certain that your own poll matches the election outcome.

How to identify likely voters

Again from Chapter 9, "How to do an Election Survey"

The low election participation rates in the United States make life hard for pollsters. In the 1988 presidential election, only 50 percent of the voting age population showed up at the polls. The low turnout creates two obvious problems:

  1. You need a bigger sample. To get the 3 percent error margin provided by a sample of 1,000, you have to interview 2,000 people in order to end up with 1,000 voters.

  2. You have to figure out which of the people in your oversized sample will belong to the voting 50 percent.

The second problem is by far the most difficult. Of course, you can just ask people if they plan to vote or not. The trouble with that tactic is that voting is perceived as a socially useful activity, and so respondents do not like to admit not participating. About 80 percent say they are registered, but only about 65 percent actually are. And those who are registered greatly overestimate their likelihood of voting.

Nate Silver on Forecasting Elections

"Finding Fame With a Prescient Call for Obama"

In an election season of unlikely outcomes, Mr. Silver, 30, is perhaps the most unlikely media star to emerge. A baseball statistician who began analyzing political polls only last year, he introduced his site,, in March, where he used his own formula to predict federal and state results and run Election Day possibilities based on a host of factors.

Other sites combine polls, notably RealClearPolitics and Pollster, but FiveThirtyEight, which drew almost five million page views on Election Day, has become one of the breakout online stars of the year. Mr. Silver recognized that people wanted to play politics like they played fantasy baseball, and pick apart poll numbers for themselves instead of waiting for an evening news anchor to interpret polls for them.

Principle: "Think probabilistically"

via Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (Kindle Locations 1068-1069). Penguin Group US. Kindle Edition:

How likely is a candidate to win, for instance, if he’s ahead by five points in the polls? …The answer depends significantly on the type of race that he’s involved in. The further down the ballot you go, the more volatile the polls tend to be: polls of House races are less accurate than polls of Senate races, which are in turn less accurate than polls of presidential races. Polls of primaries, also, are considerably less accurate than general election polls. During the 2008 Democratic primaries, the average poll missed by about eight points, far more than implied by its margin of error.

The problems in polls of the Republican primaries of 2012 may have been even worse. 26 In many of the major states, in fact— including Iowa, South Carolina, Florida, Michigan, Washington, Colorado, Ohio, Alabama, and Mississippi— the candidate ahead in the polls a week before the election lost. But polls do become more accurate the closer you get to Election Day. Figure 2-4 presents some results from a simplified version of the FiveThirtyEight Senate forecasting model, which uses data from 1998 through 2008 to infer the probability that a candidate will win on the basis of the size of his lead in the polling average.__ A Senate candidate with a five-point lead on the day before the election, for instance, has historically won his race about 95 percent of the time__— almost a sure thing, even though news accounts are sure to describe the race as “too close to call.” By contrast, a five-point lead a year before the election translates to just a 59 percent chance of winning— barely better than a coin flip.

Via Figure 2-4 "Probability of Senate Candidate Winning, Based on Size of Lead in Polling", courtesy Nate Silver:

Points ahead 1 5 10 20
Days until election        
1 64% 95% 99.70% 99.999%
7 60% 89% 98% 99.970%
30 57% 81% 95% 99.700%
90 55% 72% 87% 98.000%
180 53% 66% 79% 93.000%
365 52% 59% 67% 81.000%

Render in graph form (Google Spreadsheet):

"nate silver graph"

Principle: "Looking for Consensus"

via Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (Kindle Locations 1140-1143). Penguin Group US. Kindle Edition:

Quite a lot of evidence suggests that aggregate or group forecasts are more accurate than individual ones, often somewhere between 15 and 20 percent more accurate depending on the discipline. That doesn’t necessarily mean the group forecasts are good. (We’ll explore this subject in more depth later in the book.) But it does mean that you can benefit from applying multiple perspectives toward a problem.

Silver On the Colbert Report

Nate Silver on Oct. 7, 2008

Nov. 5, 2012 Colbert Report appearance - "Nate Silver calls polls simple, admits that he is anti-pundit and predicts an Obama election win."

The Senate Forecasters


General principles of sampling

via Chapter 5, "Surveys" in Philip Meyer's "Precision Journalism", 1991 edition:

The kind of sample you draw depends, of course, on the method of data collection. If you are going to do it by mail, you need a sample that includes addresses. If by phone, you need phone numbers. If in person and at home, you can get by without either of these, at least in the opening stages. You will probably use instead the census count of housing units.

Regardless of the method, the basic statistical rule of sampling still applies:

Each member of the population to which you wish to generalize must have a known chance of being included in the sample.

Gallup's flawed sample


via Gallup, Nov. 5, 2012: Romney 49%, Obama 48% in Gallup's Final Election Survey

Result (via NPR's data team and the AP):


via Nate Silver, FiveThirtyEight, "Which Polls Fared Best (and Worst) in the 2012 Presidential Race"

It was one of the best-known polling firms, however, that had among the worst results. In late October, Gallup consistently showed Mr. Romney ahead by about six percentage points among likely voters, far different from the average of other surveys. Gallup’s final poll of the election, which had Mr. Romney up by one point, was slightly better, but still identified the wrong winner in the election. Gallup has now had three poor elections in a row. In 2008, their polls overestimated Mr. Obama’s performance, while in 2010, they overestimated how well Republicans would do in the race for the United States House.

The cell phone bias

Cell Phones and Election Polls: An Update (Oct. 13, 2010) by Pew Research:

The latest estimates of telephone coverage by the National Center for Health Statistics found that a quarter of U.S. households have only a cell phone and cannot be reached by a landline telephone. Cell-only adults are demographically and politically different from those who live in landline households; as a result, election polls that rely only on landline samples may be biased.

In three of four election polls conducted since the spring of this year, estimates from the landline samples alone produced slightly more support for Republican candidates and less support for Democratic candidates, resulting in differences of four to six points in the margin.

Gallup postmortem

via USA Today's Martha T. Moore, June 4, 2013, "Gallup identifies flaws in 2012 election polls"

Gallup, with researchers from the University of Michigan, will experiment with ways to better identify likely voters in surveys during the 2013 governor's races in New Jersey and Virginia. Gallup asks seven questions in its phone surveys to determine whether people are likely to vote – a questionnaire that may rely too much on past voting and on how much "thought" voters have given to the election, Gallup Poll editor in chief Frank Newport said. Though all polling outfits showed an increase of support for Romney among likely voters vs. registered voters, Gallup's bump for Romney was the most extreme. "We really are re-evaluating that from square one," Newport said.

In a six-month postmortem review, Gallup determined that part of the poll's overstatement of Romney support arose from too few phone interviews in the Eastern and Pacific time zones, overstating the white vote through a flawed procedure for racial identification, and relying on listed landline phone numbers:

Poll reliability

FiveThirtyEight's pollster ratings (screenshot):


How 538 calculates pollster ratings

The short answer is that pollster performance is predictable — to some extent. Polling data is noisy and bad pollsters can get lucky. But pollster performance is predictable on the scale of something like the batting averages of Major League Baseball players.

HuffPo's chart of poll charts:


Alaska snapshot



  • Analyze polling methodologies

    Due by next class

    Choose 2 of the 3 methodologies for the Upshot’s Leo, FiveThirtyEight, WaPo’s Election Lab. In a file named, explain the five most significant differences between the two methodologies you’ve chosen to read.

    Also, write about: which of the methodologies is better, and why?

  • Make your own poll-aggregating forecast

    Due by next class

    Within your group, make a copy of this homemade polling spreadsheet

    Then, visit the HuffPo Pollster, find the state that you’re supposed to analyze, and enter in at least 24 polls that you’ve found. There are two components to this:

    Boilerplate for you to manually copy
    • The name of the poll
    • The URL to the poll
    • The date of the poll
    • The size of the poll (if given)
    • The method of the poll (e.g. Live phone, internet poll)
    • The stated margin of error (if given) in percentage points
    • The state Republican margin of victory in percentage points
    Half-educated guesses for you to make

    Based on what you have read about the factors in polling, assign positive or negative adjustments in the Date, Pollster credibility, and Poll conduct columns. These are pretty much at your discretion. For example, for a month-old poll, you may want to give it -30 in the Date adjustment. All of the adjustments are added/subtracted to the “Total weight”, which then weights the given Republican Margin of Victory.

    At the bottom of the chart is a formula that will sum up all the weighted margins.

    Again, all of this adjustment-business is up to you and your educated guessing. Next week, expect to write a short essay explaining why your weighted estimate was very right/wrong.

  • Senate prediction pool

    Due by next class

    Each group will guess the margin of error for each of the highly-contested Senate races. Winners get something TBA.

    Fill out the spreadsheet here for your state group row.

    1. For your state, enter in the sum of the weighted margin you got in your homemade poll aggregator.
    2. For all the other states, just make an educated guess.
  • Campaign finance data

    Due by next class

    Gather/clean up the FEC data as mentioned in the assignment from last class.

    For starters, fill out your state row in this spreadsheet.

    Your group is responsible for making charts that illustrate the 3 following topics:

    1. Contributions to the candidates, both in 2008 and 2014. One example would be to show the difference of in-state versus out-of-state in 2008 vs. 2014.
    2. How candidates spent their money. Do not just do a simple summation of 2008 vs 2014, but show the breakdowns of cateogries of spending.
    3. How independent expenditures were used for/against the 2014 candidates (there is no data for the 2008 race).

    I leave the nature of the charts to you. In a subsequent class, we will do a class critique of the visualization choices made in this assignment.

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.