COMM 273D | Fall 2014

Thursday, October 23

Midterm Malarkey with Military Surplus

A casual midterm covering the range of data analysis and programming skills acquired so far.

Topics

A midterm on SQL and data
Data on military surplus distributed to U.S. counties
U.S. Census QuickFacts

Homework

Do the Midterm Due Thursday, October 30

Jump to the full details on homework assignments

The notes for this section consist entirely of the description for the dataset used for the midterm, example queries to help you understand the dataset, and then a few questions for you to tackle on your own.

Everything covered here has been covered in the tutorials, though you'll probably want to make sure you have a good understanding of LEFT joins and many-to-one-relationships.

Update: Here are the answers

Previous homework

Regarding the time-series (or histogram) chart, by political party, of Congressmembers and the number of Tweets mentioning "ebola" as of October 20, 2014.

Using the MySQL dump or SQLite dump of of Congress and Twitter data, run this query to find every tweet mentioning "ebola" and selecting the relevant fields:

SELECT 
    members.first_name, members.last_name, 
    members.party, members.current_role, members.state,
    social_accounts.twitter_screen_name,
    tweets.text, DATE(tweets.created_at) as tweet_date
  FROM members
  JOIN social_accounts
    ON members.bioguide_id = social_accounts.bioguide_id
  JOIN tweets
    ON social_accounts.twitter_screen_name = tweets.screen_name
  WHERE
    tweets.text LIKE '%ebola%'
  ORDER BY tweet_date

The result is 811 tweets, which I've uploaded to this Google Spreadsheet. From here, it's just adding some derived date columns and doing a pivot table.

I've made a couple of editorial choices: the database contains only 2 tweets (both by Democrats) in the year 2013, and only 2 tweets by an Independent. The outliers would skew the chart (i.e. create a lot of empty space) so I've filtered them out, so that the chart only shows 2014 tweets by Democrats and Republicans.

Along the x-axis, I've chosen to do the week number (which can be found using Google Spreadsheet's WEEKNUM function, or MySQL's WEEK function) derived from each tweet's created_at value.

The resulting chart:

ebola chart

Note: Alternatively, if you wanted to reduce the amount of Pivot Table work, you could've done the aggregations inside of SQL.

For MySQL:

SELECT 
    members.party, 
    WEEK(tweets.created_at) AS tweet_week,
    YEAR(tweets.created_at) AS tweet_year,
    COUNT(*) AS tweet_count
  FROM members
  JOIN social_accounts
    ON members.bioguide_id = social_accounts.bioguide_id
  JOIN tweets
    ON social_accounts.twitter_screen_name = tweets.screen_name
  WHERE
    tweets.text LIKE '%ebola%'
  GROUP BY party, tweet_year, tweet_week
  ORDER BY party, tweet_year, tweet_week

For SQLite (remember the list of format strings for the strftime function?):

SELECT 
    members.party, 
    strftime('%W', tweets.created_at) AS tweet_week,
    strftime('%Y', tweets.created_at) AS tweet_year,
    COUNT(*) AS tweet_count
  FROM members
  JOIN social_accounts
    ON members.bioguide_id = social_accounts.bioguide_id
  JOIN tweets
    ON social_accounts.twitter_screen_name = tweets.screen_name
  WHERE
    tweets.text LIKE '%ebola%'
  GROUP BY party, tweet_year, tweet_week
  ORDER BY party, tweet_year, tweet_week

Midterm information

The following section is all about the midterm and the data we're using.

Download the data:

SQL
SQLite

The 1033 Program

img Wikipedia

Photo courtesy of Wikipedia user Loavesofbread licensed under Creative Commons Attribution-Share Alike 4.0 International

The 1033 Program was established to allow the U.S. Defense Dept. to give away its surplus equipment to U.S. law enforcement agencies. More than $5.1 billion of military gear has been distributed to the nearly 8,000 law enforcement agencies who participate in the program.

Congress initially authorized the transfer of military gear for the War on Drugs. The 1033 program was later open to any U.S. law enforcement agency, though preference is given to requests related to counter-drug and counter-terrorism requests.

The program has, of late, come under scrutiny because of photos of heavily-equipped police during the civil unrest in Ferguson during 2014.

NPR has some documentation arising from their analysis of the federal 1033 Program data and statewide data that they've acquired. You may want to read it to figure out what some of the data columns mean.

The California Governor's Office of Emergency Services has posted a helpful 1033 Program page in which you can find links to more background information and data.

MuckRock has been making public records requests for additional data from each state. I haven't included their tables into this midterm, but their reporting is well-worth reading to understand the issues and context of the 1033 program.

The five data tables in the midterm database

The SQL version.
The SQLite version
leso - the main table of interest. It contains the list of military surplus items distributed by the 1033 program to law enforcement agencies from January 2006 to July 2014. It is derived from the official spreadsheet maintained by the Law Enforcement Support OFfice, which you can download in Excel format here.
county_boundaries - A list of KML boundary data for every U.S. county, courtesy of Google. See their Fusion Table here
county_ansi - County names and FIPS codes. I've "cleaned" up the county field so that it matches the ways counties were (mis)spelled in the leso table. Read more about ANSI and FIPS here.
psc - A list of Product Service Codes and descriptions, from the Federal Procurement Data System. You may find it useful for browsing the general categories of things that the 1033 program distributes. Read more about PSCs here.
census_quickfacts - A variety of U.S. Census demographic and population statistics per county. Please read the DataDict.txt file provided on the U.S. Census's overview page.

Making sense of the U.S. Census QuickFacts table

To reiterate: the census_quickfacts table is not going to make much sense because the headers are things like AGE775213 (Persons 65 years and over, percent, 2013) and RHI725213 (Hispanic or Latino, percent, 2013). So you must read the Census Quickfacts Data Dictionary to make sense of it.

Technical data munging note

Like every data set we've dealt with so far, the Census and LESO data is what we call, "messy". For example, the original LESO spreadsheet, which I downloaded (direct link) via the California state site, comprised separate lists divided by alphabetical ordering of the states. So I had to manually copy and paste the sheets together to make one sheet.

The Census QuickFacts table was not in itself "messy"; the problem is that the LESO data only refers to state and county names, in all uppercase, and the LESO data does not follow the ANSI standard, apparently. For example, "Bristol City, Virginia" is referred in LESO as just "BRISTOL". Whereas "Baltimore city" is correctly keyed as "BALTIMORE CITY". The inconsistencies were without rhyme or reason, leaving me to apply custom queries to fix up at least several hundreds of rows.

The upshot: if you think you have a cool story from this midterm data, don't run with it just yet. We'll want to doublecheck that I didn't mess up my data munging in such a way that it impacts your queries.

Sample queries

Most of the work will involve just inspecting the tables and what they contain. Here are some example queries to demonstrate what's in them.

Quick syntax note

When referring to a column that has a space in the name, such as Item Cost, you must enclose the table name either inside of backticks (the little used key in the top-left corner of your keyboard, under the tilde sign), `

Examples:

SELECT `Item Cost` FROM leso

This will also work:

SELECT leso.`Item Cost` FROM leso

And this if you really want to be explicit about table names:

SELECT `leso`.`Item Cost` FROM leso

SELECT county_ansi.state, county_ansi.county, 
      AGE775213 as senior_pop, county_boundaries.geometry
  FROM county_ansi
  INNER JOIN census_quickfacts
     ON county_ansi.fips = census_quickfacts.fips
  INNER JOIN county_boundaries 
     ON county_ansi.fips = county_boundaries.fips

Exporting the result of this table to Fusion Tables creates a map like this:

See the Fusion Table here.

Deriving a column totaling the LESO acquisition costs

The leso table includes a Quantity column and an Acquisition Cost. If you want to find the total cost per line item, then you need a column in which Quantity is multiplied by Acquisition Cost:

SELECT `Item Name`, UI, Quantity,`Acquisition Cost`, 
  (Quantity * `Acquisition Cost`) AS totes_cost
  FROM leso
  ORDER BY Quantity DESC
  LIMIT 10

Item Name	UI	Quantity	Acquisition Cost	totes_cost
WIRE,ELECTRICAL	Foot	91000	0.58	52780.00
SCREW,CAP,SOCKET HEAD	Each	43822	0.23	10079.06
MAGAZINE,CARTRIDGE	Each	40000	9.31	372400.02
BANDAGE KIT,ELASTIC	EA	22428	7.00	156996.00
CPCV BRASS	LB	21100	1.00	21100.00
PANELING MATERIAL,BUILDING	Pound	10000	1.57	15700.00
MAGAZINE,CARTRIDGE	Each	9415	9.31	87653.65
BUTTSTOCK,SUBASSEMBLY	Each	9000	15.03	135270.00
5 56 EXPENDED BALL BRASS	LB	7700	1.00	7700.00
DRESSING,COMPRESSION	Each	7545	10.15	76581.75

If you want, you can create a new column named totes_cost, specify it as an INT/INTEGER or FLOAT/DECIMAL and then update it with the calculation:

UPDATE leso
  SET totes_cost = (Quantity * `Acquisition Cost`);

This way, you don't have to keep doing the multiplication for every query.

What categories of military surplus aren't being distributed?

Not every kind of military surplus is requested or eligible to be shipped out to your local police department. To find out which, do a LEFT JOIN from the psc to the leso table:

SELECT psc.`PSC CODE`, 
    psc.`PRODUCT AND SERVICE CODE FULL NAME`,  
    psc.`PRODUCT AND SERVICE CODE NOTES`, 
    psc.`PRODUCT AND SERVICE CODE INICLUDES`
  FROM psc
  LEFT JOIN leso
  ON psc.`psc code` = leso.psc_code
  WHERE leso.psc_code IS NULL
  /* include only PSC Codes that are 4 characters long */
  AND LENGTH(psc.`PSC CODE`) = 4

Apparently, nuclear bombs and flamethrowers are not yet eligible to be acquired through the 1033 Program:

PSC CODE	PRODUCT AND SERVICE CODE FULL NAME	PRODUCT AND SERVICE CODE NOTES	PRODUCT AND SERVICE CODE INICLUDES
1025	Guns, over 150 mm through 200 mm Includes Firing Platforms; Mounts; Gun Shields.
1030	Guns, over 200 mm through 300 mm Includes Gun Yokes; Rammers; Reflectors.
1035	Guns, over 300 mm		Breech Mechanisms; Training Gears; Power Drives.
1040	Chemical Weapons and Equipment		Flame Throwers; Smoke Generators.
1045	Launchers, Torpedo and Depth Charge		Depth Charge Tracks; Torpedo Tubes.
1055	Launchers, Rocket and Pyrotechnic		Airborne Rocket Launchers adaptable to guided missile use.
1070	Nets and Booms, Ordnance	This class includes nets and booms for harbor defense only.
1075	Degaussing and Mine Sweeping Equipment
1105	Nuclear Bombs	This class includes nuclear weapons (including bombs), which are designed to be dropped from an aircraft.	Ballistic cases, tail assemblies, retardation devices, and other peculiar components which are not classifiable elsewhere.
1110	Nuclear Projectiles	This class includes nuclear weapons which are designed to be propelled from a recoilless rifle, gun, howitzer, or the like, and which are not designed to be self propelled.	Ogive sections, body sections, bases, and other peculiar components which are not classifiable elsewhere.

Midterm questions

For all of these questions, besides giving the answer, please include the SQL quer(ies) and other spreadsheet/data steps you took to get to the answer.

Query the database and create a list of every county (with state) that has acquired at least one mine-resistant armored vehicle. Then map that list (Fusion Tables will probably be the easiest method). The map but not exactly the same, as the list behind this NYT interactive map of armored vehicles.
Query the database and find the most expensive single item that can be acquired through the 1033 program. Then generate a list of all counties that has so far acquired at least one of these things (you'll likely do two queries to answer this question).
Query the database to get the top 10 counties ordered by the total number of guns acquired through the 1033 program. Your answer should look very similar to what NPR found under the Total Guns Acquired graph (e.g. 3,452 guns for Los Angeles)
Query the database to get the top 10 counties ordered by number of guns acquired per 1,000 people using the Census's 2013 population estimate. Again, your answers should look similar (but not exact) to what NPR found, e.g. 28 guns/1,000 people in Franklin, KY.
Create a time-series/histogram showing something you find interesting in the 1033 Program data. Examples: Number of gas masks versus night-goggles distributed by year. Monetary value of pants and trousers versus rifles, by year.
Find all of the counties that have not acquired a single thing in the leso table, and then map those counties. If you order your list by population, the top two counties should be Kings County and Bronx County in New York (for reasons that should make sense if you have lived in New York).
Find the county that has acquired the most from the 1033 Program in terms of total acquisition cost and generate a list of the items that county has acquired.
Write a 300-word story memo on an interesting query (or queries) of your own. The memo should include examples/results of a data query and why you think said results are newsworthy or worth further investigation. Your memo, ideally, would involve some additional research to see if anyone else has found anything related to your inquiry. A chart/graphic is optional.

Example story ideas:
- Finding the smallest counties (by population) that have acquired the most heavy equipment despite no obvious need for such equipment. Is that equipment in use, and how? What is the maintenance and training cost of the equipment?
- Which counties have acquired a high amount of surplus equipment in earlier years (2006 to 2007) but very little, or none at all, in recent years?
- What kinds of equipment have only recently been distributed through the 1033 Program, but are currently in seemingly high demand?
- What are the surplus items that have no acquisition cost? And is the lack of cost consistent across all records of those items (and similar items)? Is the lack of cost because of sloppy data entry? Or because of the type of item?

Submit the answers to these questions, including all relevant queries, links, screenshots, etc., as a Markdown file named midterm.md in your private Github repo.

Homework

Do the Midterm

Due Thursday, October 30
1. Create a file named midterm.md in your private Github repo
2. Answer the question in the lecture notes.

Course schedule

Tuesday, September 23

The singular of data is anecdote

An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Thursday, September 25

Bad big data

Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Tuesday, September 30

DIY Databases

Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Thursday, October 2

Data in the newsroom

Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Tuesday, October 7

The points of maps

Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
Thursday, October 9

The shapes of maps

A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
Tuesday, October 14

Introduction to SQL for Data Journalism

The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
Thursday, October 16

A needle in multiple haystacks

The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Tuesday, October 21

Haystacks without needles

Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
Thursday, October 23

Midterm Malarkey with Military Surplus

A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
Tuesday, October 28

Campaign Cash Check

The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
Thursday, October 30

Predicting the elections

With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
- Forecasting
Tuesday, November 4

Election day (No class)

Do your on-the-ground reporting
- No class because of Election Day Coverage
Thursday, November 6

Storytelling with Data Visualization

While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
Tuesday, November 11

Dirty data, cleaned dirt cheap

One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
- OpenRefine
- Clustering
Thursday, November 13

Guest speaker: Simon Rogers

Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
Tuesday, November 18

What we say and what we do

When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Thursday, November 20

Project prep and discussion

Discussion of final projects before the Thanksgiving break.
Tuesday, November 25

Thanksgiving break

Holiday - no class
Thursday, November 27

Thanksgiving break

Holiday - no class
Tuesday, December 2

Project wrapup

Last-minute help on final projects.
Thursday, December 4

Project Show-N-Tell

In-class presentations of our final data projects.

Thursday, October 23

Midterm Malarkey with Military Surplus

Topics

Homework

Previous homework

Midterm information

Download the data:

The 1033 Program

The five data tables in the midterm database

Making sense of the U.S. Census QuickFacts table

Technical data munging note

Sample queries

Quick syntax note

Deriving a column totaling the LESO acquisition costs

What categories of military surplus aren't being distributed?

Midterm questions

Homework

Do the Midterm

Course schedule