COMM 273D | Fall 2014

Thursday, October 23

Midterm Malarkey with Military Surplus

A casual midterm covering the range of data analysis and programming skills acquired so far.

Topics

  • A midterm on SQL and data
  • Data on military surplus distributed to U.S. counties
  • U.S. Census QuickFacts

Homework

  • Do the Midterm Due Thursday, October 30

Jump to the full details on homework assignments

The notes for this section consist entirely of the description for the dataset used for the midterm, example queries to help you understand the dataset, and then a few questions for you to tackle on your own.

Everything covered here has been covered in the tutorials, though you'll probably want to make sure you have a good understanding of LEFT joins and many-to-one-relationships.

Update: Here are the answers

Previous homework

Regarding the time-series (or histogram) chart, by political party, of Congressmembers and the number of Tweets mentioning "ebola" as of October 20, 2014.

Using the MySQL dump or SQLite dump of of Congress and Twitter data, run this query to find every tweet mentioning "ebola" and selecting the relevant fields:

SELECT 
    members.first_name, members.last_name, 
    members.party, members.current_role, members.state,
    social_accounts.twitter_screen_name,
    tweets.text, DATE(tweets.created_at) as tweet_date
  FROM members
  JOIN social_accounts
    ON members.bioguide_id = social_accounts.bioguide_id
  JOIN tweets
    ON social_accounts.twitter_screen_name = tweets.screen_name
  WHERE
    tweets.text LIKE '%ebola%'
  ORDER BY tweet_date

The result is 811 tweets, which I've uploaded to this Google Spreadsheet. From here, it's just adding some derived date columns and doing a pivot table.

I've made a couple of editorial choices: the database contains only 2 tweets (both by Democrats) in the year 2013, and only 2 tweets by an Independent. The outliers would skew the chart (i.e. create a lot of empty space) so I've filtered them out, so that the chart only shows 2014 tweets by Democrats and Republicans.

Along the x-axis, I've chosen to do the week number (which can be found using Google Spreadsheet's WEEKNUM function, or MySQL's WEEK function) derived from each tweet's created_at value.

The resulting chart:

ebola chart

Note: Alternatively, if you wanted to reduce the amount of Pivot Table work, you could've done the aggregations inside of SQL.

For MySQL:

SELECT 
    members.party, 
    WEEK(tweets.created_at) AS tweet_week,
    YEAR(tweets.created_at) AS tweet_year,
    COUNT(*) AS tweet_count
  FROM members
  JOIN social_accounts
    ON members.bioguide_id = social_accounts.bioguide_id
  JOIN tweets
    ON social_accounts.twitter_screen_name = tweets.screen_name
  WHERE
    tweets.text LIKE '%ebola%'
  GROUP BY party, tweet_year, tweet_week
  ORDER BY party, tweet_year, tweet_week

For SQLite (remember the list of format strings for the strftime function?):

SELECT 
    members.party, 
    strftime('%W', tweets.created_at) AS tweet_week,
    strftime('%Y', tweets.created_at) AS tweet_year,
    COUNT(*) AS tweet_count
  FROM members
  JOIN social_accounts
    ON members.bioguide_id = social_accounts.bioguide_id
  JOIN tweets
    ON social_accounts.twitter_screen_name = tweets.screen_name
  WHERE
    tweets.text LIKE '%ebola%'
  GROUP BY party, tweet_year, tweet_week
  ORDER BY party, tweet_year, tweet_week

Midterm information

The following section is all about the midterm and the data we're using.

Download the data:

The 1033 Program

img Wikipedia

Photo courtesy of Wikipedia user Loavesofbread licensed under Creative Commons Attribution-Share Alike 4.0 International

The 1033 Program was established to allow the U.S. Defense Dept. to give away its surplus equipment to U.S. law enforcement agencies. More than $5.1 billion of military gear has been distributed to the nearly 8,000 law enforcement agencies who participate in the program.

Congress initially authorized the transfer of military gear for the War on Drugs. The 1033 program was later open to any U.S. law enforcement agency, though preference is given to requests related to counter-drug and counter-terrorism requests.

The program has, of late, come under scrutiny because of photos of heavily-equipped police during the civil unrest in Ferguson during 2014.

NPR has some documentation arising from their analysis of the federal 1033 Program data and statewide data that they've acquired. You may want to read it to figure out what some of the data columns mean.

The California Governor's Office of Emergency Services has posted a helpful 1033 Program page in which you can find links to more background information and data.

MuckRock has been making public records requests for additional data from each state. I haven't included their tables into this midterm, but their reporting is well-worth reading to understand the issues and context of the 1033 program.

The five data tables in the midterm database

  • The SQL version.
  • The SQLite version

  • leso - the main table of interest. It contains the list of military surplus items distributed by the 1033 program to law enforcement agencies from January 2006 to July 2014. It is derived from the official spreadsheet maintained by the Law Enforcement Support OFfice, which you can download in Excel format here.

  • county_boundaries - A list of KML boundary data for every U.S. county, courtesy of Google. See their Fusion Table here

  • county_ansi - County names and FIPS codes. I've "cleaned" up the county field so that it matches the ways counties were (mis)spelled in the leso table. Read more about ANSI and FIPS here.

  • psc - A list of Product Service Codes and descriptions, from the Federal Procurement Data System. You may find it useful for browsing the general categories of things that the 1033 program distributes. Read more about PSCs here.

  • census_quickfacts - A variety of U.S. Census demographic and population statistics per county. Please read the DataDict.txt file provided on the U.S. Census's overview page.

Making sense of the U.S. Census QuickFacts table

To reiterate: the census_quickfacts table is not going to make much sense because the headers are things like AGE775213 (Persons 65 years and over, percent, 2013) and RHI725213 (Hispanic or Latino, percent, 2013). So you must read the Census Quickfacts Data Dictionary to make sense of it.

Technical data munging note

Like every data set we've dealt with so far, the Census and LESO data is what we call, "messy". For example, the original LESO spreadsheet, which I downloaded (direct link) via the California state site, comprised separate lists divided by alphabetical ordering of the states. So I had to manually copy and paste the sheets together to make one sheet.

The Census QuickFacts table was not in itself "messy"; the problem is that the LESO data only refers to state and county names, in all uppercase, and the LESO data does not follow the ANSI standard, apparently. For example, "Bristol City, Virginia" is referred in LESO as just "BRISTOL". Whereas "Baltimore city" is correctly keyed as "BALTIMORE CITY". The inconsistencies were without rhyme or reason, leaving me to apply custom queries to fix up at least several hundreds of rows.

The upshot: if you think you have a cool story from this midterm data, don't run with it just yet. We'll want to doublecheck that I didn't mess up my data munging in such a way that it impacts your queries.

Sample queries

Most of the work will involve just inspecting the tables and what they contain. Here are some example queries to demonstrate what's in them.

Quick syntax note

When referring to a column that has a space in the name, such as Item Cost, you must enclose the table name either inside of backticks (the little used key in the top-left corner of your keyboard, under the tilde sign), `

Examples:

SELECT `Item Cost` FROM leso

This will also work:

SELECT leso.`Item Cost` FROM leso

And this if you really want to be explicit about table names:

SELECT `leso`.`Item Cost` FROM leso
SELECT county_ansi.state, county_ansi.county, 
      AGE775213 as senior_pop, county_boundaries.geometry
  FROM county_ansi
  INNER JOIN census_quickfacts
     ON county_ansi.fips = census_quickfacts.fips
  INNER JOIN county_boundaries 
     ON county_ansi.fips = county_boundaries.fips 

Exporting the result of this table to Fusion Tables creates a map like this:

See the Fusion Table here.

Deriving a column totaling the LESO acquisition costs

The leso table includes a Quantity column and an Acquisition Cost. If you want to find the total cost per line item, then you need a column in which Quantity is multiplied by Acquisition Cost:

SELECT `Item Name`, UI, Quantity,`Acquisition Cost`, 
  (Quantity * `Acquisition Cost`) AS totes_cost
  FROM leso
  ORDER BY Quantity DESC
  LIMIT 10
Item Name UI Quantity Acquisition Cost totes_cost
WIRE,ELECTRICAL Foot 91000 0.58 52780.00
SCREW,CAP,SOCKET HEAD Each 43822 0.23 10079.06
MAGAZINE,CARTRIDGE Each 40000 9.31 372400.02
BANDAGE KIT,ELASTIC EA 22428 7.00 156996.00
CPCV BRASS LB 21100 1.00 21100.00
PANELING MATERIAL,BUILDING Pound 10000 1.57 15700.00
MAGAZINE,CARTRIDGE Each 9415 9.31 87653.65
BUTTSTOCK,SUBASSEMBLY Each 9000 15.03 135270.00
5 56 EXPENDED BALL BRASS LB 7700 1.00 7700.00
DRESSING,COMPRESSION Each 7545 10.15 76581.75

If you want, you can create a new column named totes_cost, specify it as an INT/INTEGER or FLOAT/DECIMAL and then update it with the calculation:

UPDATE leso
  SET totes_cost = (Quantity * `Acquisition Cost`);

This way, you don't have to keep doing the multiplication for every query.

What categories of military surplus aren't being distributed?

Not every kind of military surplus is requested or eligible to be shipped out to your local police department. To find out which, do a LEFT JOIN from the psc to the leso table:

SELECT psc.`PSC CODE`, 
    psc.`PRODUCT AND SERVICE CODE FULL NAME`,  
    psc.`PRODUCT AND SERVICE CODE NOTES`, 
    psc.`PRODUCT AND SERVICE CODE INICLUDES`
  FROM psc
  LEFT JOIN leso
  ON psc.`psc code` = leso.psc_code
  WHERE leso.psc_code IS NULL
  /* include only PSC Codes that are 4 characters long */
  AND LENGTH(psc.`PSC CODE`) = 4 

Apparently, nuclear bombs and flamethrowers are not yet eligible to be acquired through the 1033 Program:

PSC CODE PRODUCT AND SERVICE CODE FULL NAME PRODUCT AND SERVICE CODE NOTES PRODUCT AND SERVICE CODE INICLUDES
1025 Guns, over 150 mm through 200 mm Includes Firing Platforms; Mounts; Gun Shields.    
1030 Guns, over 200 mm through 300 mm Includes Gun Yokes; Rammers; Reflectors.    
1035 Guns, over 300 mm   Breech Mechanisms; Training Gears; Power Drives.
1040 Chemical Weapons and Equipment   Flame Throwers; Smoke Generators.
1045 Launchers, Torpedo and Depth Charge   Depth Charge Tracks; Torpedo Tubes.
1055 Launchers, Rocket and Pyrotechnic   Airborne Rocket Launchers adaptable to guided missile use.
1070 Nets and Booms, Ordnance This class includes nets and booms for harbor defense only.  
1075 Degaussing and Mine Sweeping Equipment    
1105 Nuclear Bombs This class includes nuclear weapons (including bombs), which are designed to be dropped from an aircraft. Ballistic cases, tail assemblies, retardation devices, and other peculiar components which are not classifiable elsewhere.
1110 Nuclear Projectiles This class includes nuclear weapons which are designed to be propelled from a recoilless rifle, gun, howitzer, or the like, and which are not designed to be self propelled. Ogive sections, body sections, bases, and other peculiar components which are not classifiable elsewhere.

Midterm questions

For all of these questions, besides giving the answer, please include the SQL quer(ies) and other spreadsheet/data steps you took to get to the answer.

  1. Query the database and create a list of every county (with state) that has acquired at least one mine-resistant armored vehicle. Then map that list (Fusion Tables will probably be the easiest method). The map but not exactly the same, as the list behind this NYT interactive map of armored vehicles.
  2. Query the database and find the most expensive single item that can be acquired through the 1033 program. Then generate a list of all counties that has so far acquired at least one of these things (you'll likely do two queries to answer this question).
  3. Query the database to get the top 10 counties ordered by the total number of guns acquired through the 1033 program. Your answer should look very similar to what NPR found under the Total Guns Acquired graph (e.g. 3,452 guns for Los Angeles)
  4. Query the database to get the top 10 counties ordered by number of guns acquired per 1,000 people using the Census's 2013 population estimate. Again, your answers should look similar (but not exact) to what NPR found, e.g. 28 guns/1,000 people in Franklin, KY.
  5. Create a time-series/histogram showing something you find interesting in the 1033 Program data. Examples: Number of gas masks versus night-goggles distributed by year. Monetary value of pants and trousers versus rifles, by year.
  6. Find all of the counties that have not acquired a single thing in the leso table, and then map those counties. If you order your list by population, the top two counties should be Kings County and Bronx County in New York (for reasons that should make sense if you have lived in New York).
  7. Find the county that has acquired the most from the 1033 Program in terms of total acquisition cost and generate a list of the items that county has acquired.
  8. Write a 300-word story memo on an interesting query (or queries) of your own. The memo should include examples/results of a data query and why you think said results are newsworthy or worth further investigation. Your memo, ideally, would involve some additional research to see if anyone else has found anything related to your inquiry. A chart/graphic is optional.

    Example story ideas:

    • Finding the smallest counties (by population) that have acquired the most heavy equipment despite no obvious need for such equipment. Is that equipment in use, and how? What is the maintenance and training cost of the equipment?
    • Which counties have acquired a high amount of surplus equipment in earlier years (2006 to 2007) but very little, or none at all, in recent years?
    • What kinds of equipment have only recently been distributed through the 1033 Program, but are currently in seemingly high demand?
    • What are the surplus items that have no acquisition cost? And is the lack of cost consistent across all records of those items (and similar items)? Is the lack of cost because of sloppy data entry? Or because of the type of item?

Submit the answers to these questions, including all relevant queries, links, screenshots, etc., as a Markdown file named midterm.md in your private Github repo.

Homework

  • Do the Midterm

    Due Thursday, October 30
    1. Create a file named midterm.md in your private Github repo
    2. Answer the question in the lecture notes.

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.