COMM 273D | Fall 2014

Tuesday, October 7

The points of maps

Mapping can be a dramatic way to connect data to where readers are and to what they recognize.

Topics

  • Why maps work
  • Why maps don't work
  • Introduction to Fusion Tables and TileMill

Homework

  • Map SFPD crimes Due by next class
  • Map gun-related homicides Due by next class
  • Sign up for the NICAR-L mailing list Due by next class
  • Sign up for a StackOverflow account Due by next class
  • Join the OpenData StackExchange community Due by next class

Jump to the full details on homework assignments

Maps are fun. They're easy to make. Even without doing any work, a map of anything looks like a lot of data. And all of these are reasons not to use maps when a better visualization pattern exists.

Read: When Maps Shouldn’t Be Maps by Matthew Ericson

That said, we'll now learn a few map-making tools, both because there are occasions for when maps are the right choice, and because it's another way to interact with different datasets.


Notes from last class

  • Still working on transcribing the great tips from Phillip Reese's chat.
  • Many of you are interested in Census data for your beats and projects. There is no better place to start than the Knight News Challenge-funded CensusReporter, which provides not only an attractive interface for viewing the Census's dense data, but great ways to explore its many categories and niches. Want to know the means of transportation to work for Palo Alto residents, by language spoken at home and the ability to speak English? Here you go.

    img

  • Many of you are interested in education data, or should be interested in it as a proxy to measure things like poverty and demographics. California has CBEDS, a massive data-collection program, in which you can find out everything from teacher hiring, graduation rates, student population by race/sex/age, financial spending, and performance on the state tests. I suggest you start of with a site like GreatSchools, which is aimed at parents, but makes heavy use of the school and district level data. WNYC's Schoolbook, which is based on New York's version of the data, does a great job of organizing the data for parents while producing news stories from it.

When to use a map?

A map is comparable to a standard scatterplot chart; the locations are the data points, and the space between the points indicates distance and direction from each location.

So when you need to convey geography – e.g. how far one thing is from another on Earth, and/or what is between those two things – then maps are tip-top.

The Yelp map versus the Yelp list

The Yelp mobile app (seen here on Android) is a useful example of the tradeoffs between a map view and a list view:

yelp map

When looking up lunch options, the map not only shows us where and how far the potential options are, but also, what we have to cross to get there (a four-lane road versus a pedestrian-friendly path), and the relative location between the eateries; a clump indicates that a restaurant is in a more commercial area of town, as well as having the benefit of nearby options if the first choice is completely booked.

But the list view has its own benefits. While its geographic data is limited to distance, the list can pack in more information about each "datapoint", including the name of the place, a thumbnail image, reviewers' ratings, and the kind of place. On the map, this information is only available through by pecking each map icon to bring up a pop-up.

yelp map

Which view is better? Depends on what you care about. And it depends on the situation. In a densely packed area such as downtown Manhattan, where the grid is easy to traverse with suitable options in every direction, then the list may be more useful in filtering out the many choices.

Besides the idea that the value of a map, even when it's a good map, can depend on the viewer's situation, it's also worth noting how limited maps can be in displaying data. Most of the "ink" is devoted to drawing geographical features. Other than the map points themselves, the only other data that Yelp includes is a numbering system, to let you know that one place is ranked from 1 to 20, according to the Yelp's default judgment of "relevance". It's possible Yelp could differentiate the markers by color, but then they'd have to cram a legend onto the mobile screen.

Antipattern: Population maps

via XKCD

heat maps

The point of this comic is that if you don't control for population, then a map "this is where something is/happened" will often just be a map of "this is where people live". This is not much different from data-oopsies that happen on other chart types, it's just that with maps, there's little room to provide explanatory visual cues.

And it doesn't help that the maps themselves are so darn attractive, we sometimes forget to care about the quality of its data.

John Snow's Map via Edward Tufte

Dr. John Snow’s map of the 1854 Broad Street cholera outbreak is one of the most celebrated examples of data mapping, because when viewed in retrospect, it appears to make an irrefutable argument that cholera and other diseases were transmitted through water, and not "miasma" (i.e. foul-smelling air).

Below is Snow's famous map. The clusters of black bars are deaths due to cholera. The red dot is the location of the Broad Street pump, which Snow suspected was the source of the cholera.

snow map

Edward Tufte, considered the father of modern information display, arguably made Snow's map popular:

Instead of plotting a time-series, which would simply report each day's bad news, Snow constructed a graphical display that provided direct and powerful testimony about a possible cause-effect relationship

Here, Tufte produces a hypothetical time-series of the cholera deaths, in which the chart is accurate, but reveals nothing of the true nature of the epidemic:

tufte chart

The fact that the deaths seem to decline after the removal of the Broad Street pump handle means nothing. The death rate should have fallen anyway, because so many people had either died or fled the area by that point.

Problems with Snow's map

That said, if Snow's map is an example of the best kind of data map, then it is a particularly important example of the limitations and flaws of a data map.

For starters, it is prone to the "population density anti-pattern" pointed out by xkcd: did more people die near the Broad Street water pump because of the water from the pump? Or simply because more people just happen to live there? To viewers at the time, maybe the knowledge of local population density was well-known, but the map does nothing to account for it.

Even without that ambiguity, sure, it seems that more people living near the pump died, but it's still hard to ascertain exact distances between the dead people and the Broad Street pump, versus nearby pumps: a short bird's-eye distance does not necessarily mean a short walking distance.

But the main problem with this version of Snow's map is that it also "proves" the miasma theory. Those death clusters in the center of the map aren't because of the pump, it's because that's where the bad air is!

We'll cover John Snow's work in more detail next lecture, including a less-famous variation of the map that deals with some of the objections above.

The gun permits map

In 2012, shortly after the Sandy Hook school shooting, The Journal News (in New York) published an interactive map titled, http://archive.lohud.com/interactive/article/20121223/NEWS01/121221011/Map-Where-gun-permits-your-neighborhood-. There was also an accompanying story, The gun owner next door: What you don't know about the weapons in your neighborhood

The formerly-interactive feature consisted of a Google Map with the addresses of pistol permit holders, gathered from public records requests by Journal News reporters, allowing users to see which homes near them had successfully applied for pistol permits.

gun map

via the map notes:

These maps indicate the locations of all pistol permit holders in Westchester and Rockland counties. Each dot represents an individual permit holder licensed to own a handgun — a pistol or revolver. The data does not include owners of long guns — rifles or shotguns — which can be purchased without a permit. Being included in this map does not mean the individual at a specific location owns a weapon, just that they are licensed to do so.

How did The Journal News obtain the information? Through requests to the individual county clerks under New York’s Freedom of Information Law. Isn’t that private information? No. There is no right to privacy regarding handgun ownership in New York. State law says that, at a minimum, the names and addresses of all permit holders are public record and must be disclosed.

The fact that the map is no longer interactive should be a clue to how well people took it. One of the first takeaways from it, though, is the special power of maps. Even though critics complained that the map would allow burglars to know which houses to target (or not target), clicking each dot on a Google Map would be far less efficient of a search method than a simple list of addresses. Yet even though the map inherently makes the data hard to scan, there's a certain emotional weight that comes from imagining where you are in comparison to the data points.

Raw data is not journalism

Skipping past the arguments of gun owner privacy, gun rights, the limits of the First Amendment, etc. etc., it's worth focusing on whether or not this constituted good journalism.

Looking just at the data, if we take away the map and present this data as a list, is that a public service? What's the "story" here? That many people own guns? But do we even know if that's the case? We would need to know at least the estimated population of the area, and even then, we don't know if the gun ownership rate in Westchester/Rockland, NY is out of line compared to the rest of the United States.

And even then, does this data show a correlation between the physical location of gun ownership and homicides? The Sandy Hook news angle isn't a good news angle, since the shootings didn't take place near the perpetrator's home. And yet the emotional impact of the Journal News's map is to make you think twice about the gun owners next door.

As mentioned above, the Journal News ran into a bit of controversy, which is a shame since gun permit data has been used for great journalism projects in the past, which is why it was public record in the first place. One of the quick lessons here: just showing data, whether it be on a plain list or a sophisticated online map, is not the same as doing journalism.

More reading on this:

Gun permit data wasn’t maximized

Other map-related readings:

Mapping NYC stop and frisks: some cartographic observations - Steven Romalewski

How The Rainbow Color Map Misleads - Robert Kosara

A Map that Wasn't a Map - Tasneem Raja

Interesting maps as picked by the class

From last week's assignment of finding 10 interesting maps online:

  • Mapping the Spread of the Military's Surplus Gear Shows the county-by-county distribution of some of the DOD's gear. The single-color design obscures any comparison between individual counties, but it's just as well as per capita calculations may also be misleading.

    img

  • Map of Scientific Collaborations from 2005 to 2009 - Collaborations between scientists internationally. Unclear where the data source is from but possibly mined from the credits of research paper.

    img

  • New York Health Dept. Restaurant Ratings - Besides showing the distribution of grades, this map has some useful (if sickening) filters to explore, such as showing where rodent violations took place.

    img

  • Drought's Footprint - Not just a map, but many maps, and a great demonstration of the "small multiples" technique to illustrate change over time in a non-traditional time series format.

    img

Mapping tools

For your convenience, I've created two tutorials for two of the mapping tools we'll be using:

Homework

  • Map SFPD crimes

    Due by next class

    Using a snapshot of the SFPD incident reports in 2014, repeat the steps in this Fusion Tables tutorial, but map a different category of crime and make an interesting differentiation of the markers.

    Then publish your work in your Github Pages repository as a file named, sfpd-fusion-tables-map.html. In other words, when I visit http://YOUR_USERNAME.github.io/sfpd-fusion-tables-map, I should be able to see your work.

  • Map gun-related homicides

    Due by next class

    Using TileMill, take the Slate’s collection of gun-related deaths for 2013, group the deaths by location, and map the data so that locations with more deaths will have different/bigger markers.

    Hints:

    1. You will have to use Pivot Tables for this. Think about the bare minimum of data you need to get the location data for the mapping tool and for the count of deaths.
    2. You will most likely have to copy the Pivot Table and paste (or rather, Paste special > Paste values only in Google Spreadsheets) into a new spreadsheet and do some slight cleaning up before you import it into Fusion Tables or TileMill
    3. In your private Github repo, produce a Markdown file named homicides-map.md.
    4. Write a short paragraph explaining the deficencies of this kind of map visualization (think about what correlates with more shooting deaths). Include a screenshot of the map you made from TileMill.

    Note: If you have problems installing TileMill on your own computer and if I’m not able to get it on the lab computers, then you can do this assignment with Fusion Tables.

  • Sign up for the NICAR-L mailing list

    Due by next class

    This is the best discussion group for computer-assisted reporters. Talk about anything from technical questions to the state of the industry. Sign up instructions are here. Basically, send an email to listserv@lists.missouri.edu and in the body of the message, enter:

            SUBSCRIBE NICAR-L first_name last_name
    

    If you feel like the flow of emails is too much (it’s best viewed in GMail), then send an email to listserv@lists.missouri.edu and in the body of the message (not the subject), enter:

            SET NICAR-L DIGEST
    

    To go back to getting emails one-by-one, email that same address with the body of:

            SET NICAR-L NODIGEST
    

    I personally set it to NODIGEST and then in my GMail filters, have all messages From NICAR-L@po.missouri.edu be sent to archive to keep my inbox clean.

  • Sign up for a StackOverflow account

    Due by next class

    Hands down the best place to ask questions related to code. When you Google error messages or technical issues, you will almost always end up on a StackOverflow page. Create an account so you can ask questions directly.

  • Join the OpenData StackExchange community

    Due by next class

    The OpenData StackExchange site is a StackOverflow-type site devoted to questions about data, including where to get and find it. If you’ve created a StackOverflow account, you should be able to use the same login credentials for that as for the OpenData site.

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.