COMM 273D | Fall 2014

Tuesday, October 14

Introduction to SQL for Data Journalism

The first in several sessions on learning SQL for the exploration of large datasets.

Topics

  • MySQL / SQLite
  • Select, group, and aggregate
  • Where conditionals
  • SFPD reports of larceny, narcotics, and prostitution
  • Babies, and what we name them

Homework

  • Read the SQL tutorials Due by next class
  • Investigate your own name Due by next class

Jump to the full details on homework assignments

Learning Structured Query Language, abbreviated as SQL, will be the most technically difficult part of this course. But for good reason: if you can read and write SQL, then not only have you learned a whole new language, you have also inadvertently become a programmer, one who is (clumsily) on the path of telling computers what to do rather than the other way around.

I've written a series of tutorials for this class on SQL. The good news is that if you've mastered spreadsheets and pivot tables, then you're already clever enough to know all the concepts that underlie our use of SQL. The bad news is, you certainly won't feel very clever as you diagnose one obtuse syntax error after another. But that can be overcome with the kind of repetition and practice you need to master any new language.

SQL in the wild

We've looked at important data that was scarce and unimportant data that was plentiful. But sometimes data that is both important and plentiful is just sitting there unexamined because no one knows quite knows what to do with it. In their investigation, "Unfit for Duty", Sarasota Herald-Tribune reporters Matt Doig's and Anthony Cormier's first step was to analyze Florida's police misconduct database:

This was a case where the government had this wonderful, informative dataset and they weren’t using it at all except to compile the information. I remember talking to one person at an office and saying: “How could you guys not know some of this? In five minutes of (SQL) queries you know everything about these officers?” They basically said it wasn't their job. That left a huge opportunity for us.

Matthew Doig

SQL Hell

Why is SQL so frustrating to learn? It being a language, learning it involves a fair amount of tedious memorization and practice. The level of existential frustration you may experience will be similar to what it feels like to spend years learning French only to be able to say things less intelligently than a 5-year-old.

Why learn a whole new language? Sometimes, feelings are just more succinctly expressed in a particular language, such as Sacré bleu or J'adore le fromage dans la bibliothèque.

SQL lets us express our data desires in a structured way:

SELECT Descript, Date, Time, Location 
  FROM sfpd_incidents
  WHERE Category="ASSAULT"
    AND Date > "2012";

Which results in a data table like this:

Descript Date Time Location
BATTERY 2012-01-01 00:25 200.0 Block of MARKET ST
AGGRAVATED ASSAULT WITH A KNIFE 2012-01-01 02:07 600.0 Block of EDDY ST
AGGRAVATED ASSAULT WITH BODILY FORCE 2012-01-01 01:20 300.0 Block of MASON ST
INFLICT INJURY ON COHABITEE 2012-01-01 01:20 300.0 Block of MASON ST
INFLICT INJURY ON COHABITEE 2012-01-01 02:00 DRUMM ST / SACRAMENTO ST
BATTERY 2012-01-01 02:05 200.0 Block of KING ST
BATTERY 2012-01-01 22:22 400.0 Block of GEARY ST
BATTERY 2012-01-01 03:35 800.0 Block of MARKET ST
BATTERY 2012-01-01 12:50 1500.0 Block of OAKDALE AV
BATTERY 2012-01-01 08:38 800.0 Block of BRYANT ST

Whereas with user-friendly spreadsheet software, we have to digest a vague instruction set, which assumes you've memorized the interface of a program such as Excel:

  1. Open up the spreadsheet named sfpd_incidents.xls
  2. Open the Filter menu and filter for rows in which the Category column is equal to "ASSAULT" and in which the Date column contains a value greater than "2012"
  3. Hide all of the columns except Descript, Date, Time, and Location

Bye-bye spreadsheets?

img

No, not at all. Part of the reason that SQL is very good at working with massive datasets is because it doesn't try to do the fancy kind of things that spreadsheets do. For example, you won't be writing SQL to (directly) generate a chart. Instead, you have to export from the database into a spreadsheet (or other visualization software).

Spreadsheets are great for manipulating, re-organizing, and visualizing data – but all those features make them inefficient for pure data-querying. A database, on the other hand, won't go out of its way to make things easy for you. But what it does, it does very well and very quickly, whether you're working with a thousand or a billion rows. http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html

Exploring SFPD data with SQL and spreadsheets

Below is a demonstration of both the thought process in exploring a database and the actual queries. I'm using the Sequel Pro GUI for Mac and a MySQL dump of the incident reports from the San Francisco Police Department's CABLE system.

I won't be using just SQL here; the standard spreadsheet techniques are still important for analysis and visualization.

Simple queries

After the SQL dump has been imported, we can retrieve and display all 1.49 million records with this query:

SELECT * FROM sfpd_incidents

The result is not much different than what we get when opening a spreadsheet:

img

Since we're not interested in mapping the records (just yet), we can be selective in which columns to retrieve. In this query, I ignore the location and coordinate fields:

SELECT Category, Descript, Date, Time
  FROM sfpd_incidents

The result:

Category Descript Date Time
FRAUD FORGERY, CREDIT CARD 2003-02-18 16:30
WARRANTS WARRANT ARREST 2003-04-17 22:45
LARCENY/THEFT GRAND THEFT PICKPOCKET 2003-02-18 16:05
DRUG/NARCOTIC SALE OF BASE/ROCK COCAINE 2003-02-18 17:00
OTHER OFFENSES CONSPIRACY 2003-02-18 17:00
OTHER OFFENSES PROBATION VIOLATION 2003-02-18 17:00
DRUNKENNESS UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE 2003-03-31 22:08
WARRANTS WARRANT ARREST 2003-03-31 22:08
WARRANTS WARRANT ARREST 2003-02-22 17:35
WARRANTS WARRANT ARREST 2003-04-01 15:00

To filter for a specific category of crime, we use the WHERE clause:

SELECT Category, Descript, Date, Time
  FROM sfpd_incidents
  WHERE Category = "VANDALISM"
Category Descript Date Time
VANDALISM MALICIOUS MISCHIEF, VANDALISM 2003-05-10 08:18
VANDALISM MALICIOUS MISCHIEF, GRAFFITI 2003-01-13 23:44
VANDALISM MALICIOUS MISCHIEF, BREAKING WINDOWS 2003-02-18 17:00
VANDALISM MALICIOUS MISCHIEF, BREAKING WINDOWS 2003-02-18 20:16
VANDALISM MALICIOUS MISCHIEF, BREAKING WINDOWS 2003-02-18 20:30
VANDALISM MALICIOUS MISCHIEF, VANDALISM 2003-02-18 16:45
VANDALISM MALICIOUS MISCHIEF, GRAFFITI 2003-02-17 07:00
VANDALISM MALICIOUS MISCHIEF, GRAFFITI 2003-02-18 22:55
VANDALISM MALICIOUS MISCHIEF, GRAFFITI 2003-02-18 23:58
VANDALISM MALICIOUS MISCHIEF, VANDALISM 2003-02-19 00:15

Aggregate queries

A common inquiry of crime statistics is: just how many of each kind of crime was reported?

In SQL, we use a GROUP BY clause to aggregate the records by Category, which is similar to a simple pivot table

SELECT Category, COUNT(*)
  FROM sfpd_incidents
  GROUP BY Category
Category COUNT(*)
ARSON 2622
ASSAULT 130097
BAD CHECKS 810
BRIBERY 329
BURGLARY 64890
DISORDERLY CONDUCT 8016
DRIVING UNDER THE INFLUENCE 4052
DRUG/NARCOTIC 99876
DRUNKENNESS 7774
EMBEZZLEMENT 2144
etc.  

We can order this list by count of reports in descending order and show only the top 5 with the ORDER BY and LIMIT clauses, respectively:

SELECT Category, COUNT(*) AS ct 
  FROM sfpd_incidents
  GROUP BY Category
  ORDER BY ct DESC
  LIMIT 5

So from 2003 to 2013, these were the five most reported categories of incidents to the SFPD:

Category ct
LARCENY/THEFT 296585
OTHER OFFENSES 212411
NON-CRIMINAL 151558
ASSAULT 130097
DRUG/NARCOTIC 99876

Change over time

Aggregate counts over long periods of time aren't particularly helpful. We can group by year to get the count of reports per year:

SELECT YEAR(Date) AS yr, COUNT(*)
  FROM sfpd_incidents
  GROUP BY yr
  ORDER BY yr

The count of reports by year:

yr COUNT(*)
2003 152901
2004 150349
2005 144395
2006 136145
2007 135813
2008 139400
2009 137519
2010 121525
2011 117644
2012 123878
2013 132195

Again, not particularly interesting. We don't know if one kind of crime went up while another went down. So we can make our grouping more granular: the count of incident reports, by year, and by category:

SELECT Category, YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  GROUP BY yr, Category
  ORDER BY yr, Category

The result contains 403 rows, one for each combination of year and category of report:

Category yr count
ARSON 2003 294
ASSAULT 2003 13782
BAD CHECKS 2003 144
BRIBERY 2003 28
BURGLARY 2003 6088
DISORDERLY CONDUCT 2003 906
DRIVING UNDER THE INFLUENCE 2003 315
DRUG/NARCOTIC 2003 10176
DRUNKENNESS 2003 682
EMBEZZLEMENT 2003 246
…etc.    
VEHICLE THEFT 2013 6209
WARRANTS 2013 7059
WEAPON LAWS 2013 1166

Reduce the clutter

Let's step back a bit. Instead of looking at every year, let's look at the 5-year change in reports, i.e. 2008 and 2013:

SELECT Category, YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  GROUP BY yr, Category
  HAVING yr = 2008 OR yr = 2013
  ORDER BY yr, Category

The result is 74 rows:

Category yr count
ARSON 2008 248
ASSAULT 2008 12670
BAD CHECKS 2008 78
BRIBERY 2008 49
BURGLARY 2008 5679
etc.    
SEX OFFENSES, NON FORCIBLE 2013 13
STOLEN PROPERTY 2013 12
SUICIDE 2013 73
SUSPICIOUS OCC 2013 3358
TRESPASS 2013 991
VANDALISM 2013 6654
VEHICLE THEFT 2013 6209
WARRANTS 2013 7059
WEAPON LAWS 2013 1166

A time for pivoting

While this is a more manageable list, it's still not easy to discern which categories have changed the most from 2008 to 2013.

At this point, it's worth remembering why we turned to SQL in the first place: because our spreadsheet software couldn't handle an import of 1.5 million rows (Excel caps out at 1,048,576 rows). By using the GROUP BY aggregate clause, we could reduce the data rows enough to then export into a spreadsheet right off the bat. In fact, the kind of grouping and filtering we've just accomplished may have been easier in pivot tables.

So again, there is no one perfect data tool: if you feel more comfortable organizing data in a spreadsheet, then it's fine to use a database simply to open a large dataset and export the parts you need.

Data reshaping

Now that we have 74 rows of aggregate counts (for the year 2008 and 2013), we can simply copy and paste them into a spreadsheet, and arrange them so that we can do the "subtract one cell from the other" formulas that we're familiar with.

Pasting the list directly into a spreadsheet, and then doing a pivot table with Category as Rows and year as Columns, and then a SUM of the count, will generate something that looks like this:

Category 2008 2013
ARSON 248 228
ASSAULT 12670 10380
BAD CHECKS 78 26
BRIBERY 49 4
BURGLARY 5679 5847
DISORDERLY CONDUCT 789 460
DRIVING UNDER THE INFLUENCE 408 429
DRUG/NARCOTIC 11462 6682

Then we add two columns to calculate the change:

  1. Change: the 2013 column minus the 2008 column
  2. Change pct: the Change column, divided by the 2008 column. Then multiplied by 100 and rounded to make for a nice number.

When sorting the list by Change Pct, here are the top 10 categories based on drop in incidents from 2008 to 2013

Category 2008 2013 Change Change pct
STOLEN PROPERTY 517 12 -505 -98
BRIBERY 49 4 -45 -92
LOITERING 441 52 -389 -88
PROSTITUTION 1663 269 -1394 -84
FORGERY/COUNTERFEITING 2460 710 -1750 -71
BAD CHECKS 78 26 -52 -67
LIQUOR LAWS 499 227 -272 -55
FAMILY OFFENSES 72 35 -37 -51
EXTORTION 52 30 -22 -42
DRUG/NARCOTIC 11462 6682 -4780 -42

Reversing that sort, we can get the top 10 categories according to increase in incidents from 2008 to 2013:

Category 2008 2013 Change Change pct
KIDNAPPING 345 514 169 49
LARCENY/THEFT 25797 35661 9864 38
DRUNKENNESS 710 954 244 34
NON-CRIMINAL 12293 15343 3050 25
WARRANTS 5798 7059 1261 22
DRIVING UNDER THE INFLUENCE 408 429 21 5
VEHICLE THEFT 6053 6209 156 3
MISSING PERSON 4335 4458 123 3
BURGLARY 5679 5847 168 3
FRAUD 2556 2579 23 1
Video

Watch me write out the query, execute it, and paste the results into a Google Spreadsheet to find the change from 2008 to 2013 in incident reports. In this demo, I don't use Pivot Tables, but just do some extra cutting-and-pasting as needed:

Preview of JOINs

Note: This is just a technical aside, for those of you who wonder if there's a way to avoid spreadsheets altogether.

We can actually do this all in a SQL query with a JOIN clause and subqueries, which are two intermediate concepts we haven't yet learned. But here's a preview of the fun that will be:

SELECT 
  ty.Category, 
  COUNT(1) AS count_2008, 
  ty.count_2013, (ty.count_2013 - COUNT(1)) AS `Change`, 
  /* the difference in count */
  ((ty.count_2013 - COUNT(1)) / COUNT(1)) AS `Change pct`
  /* the difference expressed as a ratio of year 2008 incidents*/
  FROM sfpd_incidents 
    AS tx
  INNER JOIN /* create a new table of just 2013 incidents, name it `ty` */
    ( SELECT Category, COUNT(1) AS count_2013 
        FROM sfpd_incidents 
        WHERE YEAR(Date) = 2013 
        GROUP BY Category ) 
    AS ty
    ON tx.Category = ty.Category
  WHERE YEAR(tx.Date) = 2008
  GROUP BY ty.Category
  ORDER BY `Change pct`

Which results in:

Category count_2008 count_2013 Change Change pct
STOLEN PROPERTY 517 12 -505 -0.9768
BRIBERY 49 4 -45 -0.9184
LOITERING 441 52 -389 -0.8821
PROSTITUTION 1663 269 -1394 -0.8382
FORGERY/COUNTERFEITING 2460 710 -1750 -0.7114
BAD CHECKS 78 26 -52 -0.6667
LIQUOR LAWS 499 227 -272 -0.5451
FAMILY OFFENSES 72 35 -37 -0.5139
EXTORTION 52 30 -22 -0.4231
DISORDERLY CONDUCT 789 460 -329 -0.4170
etc. etc.        

Pondering the numbers

Looking at the change in incident reporting over 2008 to 2013 brings up a few potential angles of investigation.

An increase in larceny

Among all the categories, "LARCENY/THEFT" has the biggest jump in numbers of reports: 25,797 to 35,661, in 2008 and 2013, respectively. Let's run a query on the database to see that jump compared to the years pre-2008:

SELECT YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  WHERE Category = 'LARCENY/THEFT'
  GROUP BY yr
  ORDER BY yr
yr count
2003 26853
2004 24822
2005 25623
2006 27349
2007 25767
2008 25797
2009 25545
2010 23888
2011 25161
2012 30119
2013 35661

Graphing that table results in:

img

My first guess would have been to attribute the drastic rise in thefts to the widespread adoption of smartphones. New York Mayor Michael Bloomberg blamed an apparent spike in 2012 crime to Apple lovers:

Major crime in New York City inched up this year, and Mayor Michael R. Bloomberg on Friday fingered the culprit: too many iPhones and iPads were being swiped…

“If you just took away the jump in Apple, we’d be down for the year,” said Marc La Vorgna, the mayor’s press secretary.

On the radio, Mr. Bloomberg said that Apple products appeared to be the preference for many thieves, noting that he was not including thefts of competing devices, like the Samsung Galaxy, in his count.

From 2011 to 2012, theft reports jumped from 25,161 to 30,119, so San Francisco may have experienced the same trend, though a closer look at the subcategories of theft (i.e. theft from auto, pickpocketing, grand theft, etc.) is necessary.

A decrease in narcotics

Let's turn our attention to categories of reports that have decreased and give credit where credit is due. According to our query, DRUG/NARCOTIC reports have dropped by 4,780 (42%).

Of course, a lot has happened since 2008, including a relaxation of laws and attitudes towards cannabis, in California and across the nation. Is this the main cause of the drop in DRUG/NARCOTIC reports from 2008 to 2013 in San Francisco?

The easiest way to confirm this is to group the data reports by subcategory, or the Descript column:

SELECT Descript, YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  WHERE Category = 'DRUG/NARCOTIC'
  GROUP BY yr, Descript
  HAVING yr = 2008 OR yr = 2013
  ORDER BY yr, Descript

Using the previous strategy of pasting the results into a pivot table and creating a spreadsheet of the calculated changes, here are the top subcategories of narcotics reports, sorted by change in count:

Descript 2008 2013 Change
POSSESSION OF BASE/ROCK COCAINE 1617 393 -1224
SALE OF BASE/ROCK COCAINE 1270 245 -1025
POSSESSION OF NARCOTICS PARAPHERNALIA 2359 1353 -1006
POSSESSION OF BASE/ROCK COCAINE FOR SALE 914 284 -630
POSSESSION OF MARIJUANA 1108 764 -344
SALE OF MARIJUANA 391 133 -258
POSSESSION OF MARIJUANA FOR SALES 571 350 -221
SALE OF CONTROLLED SUBSTANCE 233 78 -155
SALE OF HEROIN 174 43 -131
POSSESSION OF CONTROLLED SUBSTANCE 404 296 -108
POSSESSION OF HEROIN FOR SALES 199 120 -79
POSSESSION OF CONTROLLED SUBSTANCE FOR SALE 251 173 -78
POSSESSION OF COCAINE 267 214 -53
SALE OF COCAINE 70 20 -50

Marijuana-related reports do drop, but not as much as cocaine-related reports.

As always, our strategy should involve flipping our perspective: if cocaine-related reports have dropped, has any other kind of report increased?

Here is the table sorted in reverse-order of change:

Descript 2008 2013 Change
POSSESSION OF METH-AMPHETAMINE 383 956 573
POSSESSION OF METH-AMPHETAMINE FOR SALE 129 266 137
TRANSPORTATION OF METH-AMPHETAMINE 24 49 25
POSSESSION OF OPIATES 19 40 21
UNDER INFLUENCE OF DRUGS IN A PUBLIC PLACE 40 53 13

Meth-related reports have increased by nearly 140%, though the change in terms of actual number of reports (~700) is significantly smaller than the change in cocaine-related reports (~3,000).

Investigating a drop in prostitution reports

One more category of reports is worth looking at: while DRUG/NARCOTICS reports dropped by 42 percent from 2008 to 2013, PROSTITUTION reports dropped by 84 percent. Like drug-related offenses, prostitution is a category of crime in which attitudes have seemingly shifted. In 2008, more than 40 percent of San Francisco residents voted in favor of decriminalizing prostitution. Though the measure ultimately failed (59 percent opposed), the political debate was heated, thanks to the measure's surprising popularity:

SAN FRANCISCO — When Proposition K was added to Tuesday’s ballot, many people likely snickered at the possibility that San Francisco might take its place alongside such prostitute-friendly havens as Amsterdam and a few rural counties in nearby Nevada.

But this week, it became readily apparent that city officials are not laughing anymore about the measure, which would effectively decriminalize the world’s oldest profession in San Francisco. At a news conference on Wednesday, Mayor Gavin Newsom and other opponents seemed genuinely worried that Proposition K might pass.

“This is not cute. This is not fanciful,” Mr. Newsom said, standing in front of the pink-on-pink facade of a closed massage parlor in the Tenderloin district. “This is a big mistake.”

Since 2008, laws on prostitution haven't changed as significantly as they have for marijuana usage. But technology has most certainly affected all parts of sex and society, both legal and illegal activity, and maybe this has impacted the prostitution business? Or perhaps this drop in reports is simply a result of a coding policy. Let's query the dataset for possible explanations.

For reference's sake, I've posted a spreadsheet of the more than 12,000 incident reports categorized as "PROSTITUTION" from 2003 to 2013.

Visualizing location of incidents

Even in cities where prostitution is illegal, the term "red light district" is used to refer where a concentration of such activity may be found. So instead of looking at bar graphs, let's use the location data in the SFPD incident reports to get an assessment of how the geography of activity has changed, if at all.

The SQL query here is simple, because we're only concerned with getting data in a form usable by a mapping application.

SELECT 
  YEAR(Date) as yr, Descript, Date, Time, Y AS latitude, X AS longitude
  FROM sfpd_incidents
  WHERE Category = "PROSTITUTION"
  HAVING yr = 2008 OR yr = 2013

Using TileMill, we can generate a quick map. Here, the incidents in 2008 are orange, the incidents in 2013 are blue:

img

Here are the two years side-by-side:

img

Ignoring my crude mapping skills, we can at least tell that the number of reports may have decreased – the visual confusion is partly because records will be geocoded to a specific corner, so many of them will sit atop the exact same coordinate – but the general location of prostitution reports hasn't changed significantly.

Per subcategory?

As in the drug reports, it's worth taking a look at the Descript field:

SELECT Descript, Year(Date) as yr, COUNT(1) as c
  FROM sfpd_incidents
  WHERE Category = 'PROSTITUTION'
    GROUP BY yr, Descript
  HAVING yr = 2008 OR yr = 2013
    ORDER BY yr, Descript
Descript 2008 2013 Change Change pct
INMATE/KEEPER OF HOUSE OF PROSTITUTION 8 0 -8 -100
SOLICITS TO VISIT HOUSE OF PROSTITUTION 987 1 -986 -99.9
SOLICITS LEWD ACT 12 1 -11 -91.7
ENGAGING IN LEWD CONDUCT - PROSTITUTION RELATED 21 5 -16 -76.2
SOLICITS FOR ACT OF PROSTITUTION 602 171 -431 -71.6
PLACING WIFE IN HOUSE OF PROSTITUTION 3 1 -2 -66.7
PIMPING 22 37 15 68.2
PANDERING 7 22 15 214.3
INDECENT EXPOSURE - PROSTITUTION RELATED 1 31 30 3000

The majority of PROSTITUTION reports fall under SOLICITS FOR ACT OF PROSTITUTION and SOLICITS TO VISIT HOUSE OF PROSTITUTION. Both have fallen drastically in a five-year period, though the latter subcategory has all been eliminated: going from 987 reports in 2008 to just one report in 2013.

Let's make a chart

Let's visually quantify the drop in the main two subcategories of prostitution reports, across all the years from 2003 to 2013:

SELECT Descript, YEAR(Date) as yr, COUNT(1) as x
  FROM sfpd_incidents
  WHERE Category = "PROSTITUTION"  
    AND 
      Descript = "SOLICITS FOR ACT OF PROSTITUTION" 
      OR Descript = "SOLICITS TO VISIT HOUSE OF PROSTITUTION"
  GROUP BY Descript, yr
  ORDER BY Descript, yr

The following results table can be pasted into a spreadsheet and pivoted (with yr creating the rows and Descript creating the columns).

Descript yr x
SOLICITS FOR ACT OF PROSTITUTION 2003 1111
SOLICITS FOR ACT OF PROSTITUTION 2004 825
SOLICITS FOR ACT OF PROSTITUTION 2005 732
SOLICITS FOR ACT OF PROSTITUTION 2006 645
SOLICITS FOR ACT OF PROSTITUTION 2007 899
SOLICITS FOR ACT OF PROSTITUTION 2008 602
SOLICITS FOR ACT OF PROSTITUTION 2009 521
SOLICITS FOR ACT OF PROSTITUTION 2010 476
SOLICITS FOR ACT OF PROSTITUTION 2011 385
SOLICITS FOR ACT OF PROSTITUTION 2012 224
SOLICITS FOR ACT OF PROSTITUTION 2013 171
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2003 843
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2004 627
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2005 291
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2006 563
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2007 876
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2008 987
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2009 855
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2010 117
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2011 7
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2012 2
SOLICITS TO VISIT HOUSE OF PROSTITUTION 2013 1

The resulting area chart:

img

Researching the data

Never take data at face value. This interesting drop in prostitution reports could be a result of a data glitch or ignorance on my part.

Look for previously-reported facts: On April 19, 2011, the San Francisco Examiner published an article titled, San Francisco police crack down on Polk Street prostitution. One of the numerical details:

More cops are on the streets making arrests, Mannix said. During a 24-hour period last week, at least four prostitutes were arrested, including a 16-year-old runaway.

Translating the above statement into a query:

    SELECT IncidntNum, Descript, Date, Time, PdDistrict, Resolution, Location
    FROM sfpd_incidents
    WHERE Category = 'PROSTITUTION'
      AND Date BETWEEN  '2011-04-11' AND '2011-04-19'
      AND Location LIKE '%POLK%'
IncidntNum Descript Date Time PdDistrict Resolution Location
110296727 PANDERING 2011-04-12 01:38 NORTHERN ARREST, BOOKED PINE ST / POLK ST
110296727 PIMPING 2011-04-12 01:38 NORTHERN ARREST, BOOKED PINE ST / POLK ST
110302437 SOLICITS FOR ACT OF PROSTITUTION 2011-04-13 23:00 NORTHERN ARREST, CITED CALIFORNIA ST / POLK ST
110302540 SOLICITS FOR ACT OF PROSTITUTION 2011-04-14 00:05 NORTHERN ARREST, CITED POLK ST / CALIFORNIA ST
110302653 SOLICITS FOR ACT OF PROSTITUTION 2011-04-14 01:05 NORTHERN ARREST, CITED POLK ST / CALIFORNIA ST
110308758 SOLICITS FOR ACT OF PROSTITUTION 2011-04-16 00:35 NORTHERN ARREST, CITED POLK ST / CALIFORNIA ST
110308952 SOLICITS FOR ACT OF PROSTITUTION 2011-04-16 02:15 NORTHERN ARREST, CITED POLK ST / CALIFORNIA ST
110309160 SOLICITS FOR ACT OF PROSTITUTION 2011-04-16 03:40 NORTHERN ARREST, CITED POLK ST / CALIFORNIA ST
110308827 SOLICITS FOR ACT OF PROSTITUTION 2011-04-16 01:29 NORTHERN ARREST, CITED POLK ST / CALIFORNIA ST

There were four prostitution-related arrests on April 16, but frankly, that could just be a coincidence. Capt. Mannix doesn't specify if those four arrests all happened on Polk Street, for example.

Another story of a sweep, published Jan. 20, 2010: Prostitution sting nets 51 so far

A crackdown on street prostitution in response to complaints from Polk Street and Russian Hill residents has netted 51 arrests, but officials say the crackdown can’t last forever…Of the 51 people arrested so far in the current operation, 36 were women accused of prostitution offenses, according to Lt. Tom Cleary of the vice crimes unit. Fifteen were men paying for sex. Cleary emphasized that a majority of those arrested came from outside The City.

Using this aggregate SQL:

  SELECT SUBSTRING(Date, 1, 7) as yearmth, COUNT(1) AS count
  FROM sfpd_incidents
    WHERE Category = 'PROSTITUTION' AND pddistrict = 'NORTHERN'
      AND YEAR(Date) BETWEEN  2008 AND 2013
    GROUP BY yearmth
    order by count DESC

– we observe that the month of January 2010 seems to be a high-water mark in prostitution reports from the Northern district (which covers Polk St. and Russian Hill):

yearmth count
2010-01 57
2008-07 47
2008-02 46
2008-06 40
2008-10 38
2008-01 36

Sociological reasons?

Should an 84% drop in prosecution reports from 2008 to 2013 be a surprise, given voter support for decriminalizing prostitution? Whatever the ambivalence of the average resident, I didn't find any direct reports of the SFPD deciding to go lenient on prostitution.

On June 16, 2010, the SF Weekly took an in-depth look at the SFPD's arguably aggressive anti-prostitution stings:

In San Francisco, where 41 percent of voters approved a measure that would have essentially decriminalized prostitution and where a sex worker is currently running for the Board of Supervisors, men are regularly arrested just for talking to undercover cops posing as streetwalkers. If a man avoids eye contact and walks on by, he's fine. But if he shows any interest, the decoy will approach him. By the time he knows she is selling sex, he's headed for trouble. After that, any expression of interest — even a sarcastic "Oops, don't have enough cash, guess I'll come back later" — is enough to land him in handcuffs…

A decade ago, almost all the men arrested in the program's stings had shown more than interest; they had agreed to sex acts for money, according to statistics from the District Attorney's office. But in recent years, as would-be johns have gotten savvy to police tactics, roughly a third of those arrested in the stings — between 100 and 200 a year — were cited simply for demonstrating, in the eyes of the police, that they had "an intent to commit prostitution." According to the California penal code, there's no "single circumstance or combination of circumstances" that proves this intent; it has to be determined case by case.

The article cites a figure from a 2009 audit of the city's budget: previously, the sting program was funded by the fees paid by johns. However, "as prostitution has shifted to the Internet, it has become more difficult for police officers to make arrests in street-level stings. Since 2005, the number of johns arrested has dropped steadily."

In 2011, the San Francisco Public Press reported on the controversial use of anti-trafficking resources to do street-level sweeps.

This past summer, San Francisco prostitution made the news again when business owners complained about activity in Union Square.

This article quotes police as being more interested the stemming the bigger issues surrounding prostitution, rather than focusing on street-level busts:

Police said they plan to clean up Union Square, but also want to focus on human trafficking and helping women who have been forced into prostitution.


So what's the explanation for the 84% drop in SFPD prostitution reports? I sent a request to SFPD's public information officer but haven't heard back. I'm still open to the possibility of a mis-query on my part, or a coding change in the SFPD CABLE system that classifies the reports differently. Either way, I hope this has given you some ideas on how to query data along a variety of dimensions and time frames to find leads for what may be interesting stories.

Now all you have to do is grind in the syntax. Good luck!

Homework

  • Read the SQL tutorials

    Due by next class

    I’ve put together a series of step-by-step tutorials on basic SQL syntax. Read through them and repeat every exercise if necessary. Don’t just copy and paste; sometimes, typing out the queries forces you to slow down and think them through.

  • Investigate your own name

    Due by next class

    Using the Social Security Administration’s database of baby names, find out how popular (or unpopular) your name has been since 1980 (or 1880, if you wish).

    Download the appropriate file for your system. You may want to try the 1980 file if your computer is having trouble with the full 1880 file.

    SQLite

    SQL

    To get some inspiration, you can check out the SSA baby names website in which they let you perform a limited search on names.

    Please do this homework in a Markdown file in your private Github repo and call it, sql-baby-names.md

    Using the ssa_baby_names table

    1. In the year 2013, find out how many babies had your name
    2. Find the highest-rank (and the year) that your name has achieved among baby names
    3. Between the years 1980 and 2013, find how many babies in total have been listed by the Social Security Administration in this data.
    4. Find out how many babies (roughly) have had your name from 1980 to 2013
    5. Find the year that had the most male babies born
    6. Find the year in which your name had the highest increase in names-per-100k-babies born
    7. Find the year in which your name had the highest decrease in names-per-100k-babies born
    8. Make a line chart showing how your name has changed in popularity over the years
    9. Find out who in our class had the most popular baby name in 1980.
    10. Find out who in our class had the most popular baby name in 2013.
    11. Find out who in our class has the name that the most babies throughout U.S. history have.

    Using the ssa_baby_names_by_state table

    For this section, pick a adoptive state if you were not born in the U.S.

    1. In your home state, find out how many babies had your name in the year that you were born.
    2. In your home state, find out how many babies had your name in 2013
    3. Find the state in which your name is the most popular in 2013
    4. Find the state in which your name is least popular (but still registers, i.e. has a minimum of 5 babies) in 2013
    5. Make a line chart with two lines:

      1. The popularity of your name in your home state
      2. The popularity of your name in California

      (If you were born in California, pick another state, like Florida.)

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.