COMM 273D | Fall 2014

Tuesday, October 14

Introduction to SQL for Data Journalism

The first in several sessions on learning SQL for the exploration of large datasets.

Topics

MySQL / SQLite
Select, group, and aggregate
Where conditionals
SFPD reports of larceny, narcotics, and prostitution
Babies, and what we name them

Homework

Read the SQL tutorials Due by next class
Investigate your own name Due by next class

Jump to the full details on homework assignments

Learning Structured Query Language, abbreviated as SQL, will be the most technically difficult part of this course. But for good reason: if you can read and write SQL, then not only have you learned a whole new language, you have also inadvertently become a programmer, one who is (clumsily) on the path of telling computers what to do rather than the other way around.

I've written a series of tutorials for this class on SQL. The good news is that if you've mastered spreadsheets and pivot tables, then you're already clever enough to know all the concepts that underlie our use of SQL. The bad news is, you certainly won't feel very clever as you diagnose one obtuse syntax error after another. But that can be overcome with the kind of repetition and practice you need to master any new language.

SQL in the wild

We've looked at important data that was scarce and unimportant data that was plentiful. But sometimes data that is both important and plentiful is just sitting there unexamined because no one knows quite knows what to do with it. In their investigation, "Unfit for Duty", Sarasota Herald-Tribune reporters Matt Doig's and Anthony Cormier's first step was to analyze Florida's police misconduct database:

This was a case where the government had this wonderful, informative dataset and they weren’t using it at all except to compile the information. I remember talking to one person at an office and saying: “How could you guys not know some of this? In five minutes of (SQL) queries you know everything about these officers?” They basically said it wasn't their job. That left a huge opportunity for us.

– Matthew Doig

SQL Hell

Why is SQL so frustrating to learn? It being a language, learning it involves a fair amount of tedious memorization and practice. The level of existential frustration you may experience will be similar to what it feels like to spend years learning French only to be able to say things less intelligently than a 5-year-old.

Why learn a whole new language? Sometimes, feelings are just more succinctly expressed in a particular language, such as Sacré bleu or J'adore le fromage dans la bibliothèque.

SQL lets us express our data desires in a structured way:

SELECT Descript, Date, Time, Location 
  FROM sfpd_incidents
  WHERE Category="ASSAULT"
    AND Date > "2012";

Which results in a data table like this:

Descript	Date	Time	Location
BATTERY	2012-01-01	00:25	200.0 Block of MARKET ST
AGGRAVATED ASSAULT WITH A KNIFE	2012-01-01	02:07	600.0 Block of EDDY ST
AGGRAVATED ASSAULT WITH BODILY FORCE	2012-01-01	01:20	300.0 Block of MASON ST
INFLICT INJURY ON COHABITEE	2012-01-01	01:20	300.0 Block of MASON ST
INFLICT INJURY ON COHABITEE	2012-01-01	02:00	DRUMM ST / SACRAMENTO ST
BATTERY	2012-01-01	02:05	200.0 Block of KING ST
BATTERY	2012-01-01	22:22	400.0 Block of GEARY ST
BATTERY	2012-01-01	03:35	800.0 Block of MARKET ST
BATTERY	2012-01-01	12:50	1500.0 Block of OAKDALE AV
BATTERY	2012-01-01	08:38	800.0 Block of BRYANT ST

Whereas with user-friendly spreadsheet software, we have to digest a vague instruction set, which assumes you've memorized the interface of a program such as Excel:

Open up the spreadsheet named sfpd_incidents.xls
Open the Filter menu and filter for rows in which the Category column is equal to "ASSAULT" and in which the Date column contains a value greater than "2012"
Hide all of the columns except Descript, Date, Time, and Location

Bye-bye spreadsheets?

No, not at all. Part of the reason that SQL is very good at working with massive datasets is because it doesn't try to do the fancy kind of things that spreadsheets do. For example, you won't be writing SQL to (directly) generate a chart. Instead, you have to export from the database into a spreadsheet (or other visualization software).

Spreadsheets are great for manipulating, re-organizing, and visualizing data – but all those features make them inefficient for pure data-querying. A database, on the other hand, won't go out of its way to make things easy for you. But what it does, it does very well and very quickly, whether you're working with a thousand or a billion rows. http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html

Exploring SFPD data with SQL and spreadsheets

Below is a demonstration of both the thought process in exploring a database and the actual queries. I'm using the Sequel Pro GUI for Mac and a MySQL dump of the incident reports from the San Francisco Police Department's CABLE system.

I won't be using just SQL here; the standard spreadsheet techniques are still important for analysis and visualization.

Simple queries

After the SQL dump has been imported, we can retrieve and display all 1.49 million records with this query:

SELECT * FROM sfpd_incidents

The result is not much different than what we get when opening a spreadsheet:

Since we're not interested in mapping the records (just yet), we can be selective in which columns to retrieve. In this query, I ignore the location and coordinate fields:

SELECT Category, Descript, Date, Time
  FROM sfpd_incidents

The result:

Category	Descript	Date	Time
FRAUD	FORGERY, CREDIT CARD	2003-02-18	16:30
WARRANTS	WARRANT ARREST	2003-04-17	22:45
LARCENY/THEFT	GRAND THEFT PICKPOCKET	2003-02-18	16:05
DRUG/NARCOTIC	SALE OF BASE/ROCK COCAINE	2003-02-18	17:00
OTHER OFFENSES	CONSPIRACY	2003-02-18	17:00
OTHER OFFENSES	PROBATION VIOLATION	2003-02-18	17:00
DRUNKENNESS	UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE	2003-03-31	22:08
WARRANTS	WARRANT ARREST	2003-03-31	22:08
WARRANTS	WARRANT ARREST	2003-02-22	17:35
WARRANTS	WARRANT ARREST	2003-04-01	15:00

To filter for a specific category of crime, we use the WHERE clause:

SELECT Category, Descript, Date, Time
  FROM sfpd_incidents
  WHERE Category = "VANDALISM"

Category	Descript	Date	Time
VANDALISM	MALICIOUS MISCHIEF, VANDALISM	2003-05-10	08:18
VANDALISM	MALICIOUS MISCHIEF, GRAFFITI	2003-01-13	23:44
VANDALISM	MALICIOUS MISCHIEF, BREAKING WINDOWS	2003-02-18	17:00
VANDALISM	MALICIOUS MISCHIEF, BREAKING WINDOWS	2003-02-18	20:16
VANDALISM	MALICIOUS MISCHIEF, BREAKING WINDOWS	2003-02-18	20:30
VANDALISM	MALICIOUS MISCHIEF, VANDALISM	2003-02-18	16:45
VANDALISM	MALICIOUS MISCHIEF, GRAFFITI	2003-02-17	07:00
VANDALISM	MALICIOUS MISCHIEF, GRAFFITI	2003-02-18	22:55
VANDALISM	MALICIOUS MISCHIEF, GRAFFITI	2003-02-18	23:58
VANDALISM	MALICIOUS MISCHIEF, VANDALISM	2003-02-19	00:15

Aggregate queries

A common inquiry of crime statistics is: just how many of each kind of crime was reported?

In SQL, we use a GROUP BY clause to aggregate the records by Category, which is similar to a simple pivot table

SELECT Category, COUNT(*)
  FROM sfpd_incidents
  GROUP BY Category

Category	COUNT(*)
ARSON	2622
ASSAULT	130097
BAD CHECKS	810
BRIBERY	329
BURGLARY	64890
DISORDERLY CONDUCT	8016
DRIVING UNDER THE INFLUENCE	4052
DRUG/NARCOTIC	99876
DRUNKENNESS	7774
EMBEZZLEMENT	2144
etc.

We can order this list by count of reports in descending order and show only the top 5 with the ORDER BY and LIMIT clauses, respectively:

SELECT Category, COUNT(*) AS ct 
  FROM sfpd_incidents
  GROUP BY Category
  ORDER BY ct DESC
  LIMIT 5

So from 2003 to 2013, these were the five most reported categories of incidents to the SFPD:

Category	ct
LARCENY/THEFT	296585
OTHER OFFENSES	212411
NON-CRIMINAL	151558
ASSAULT	130097
DRUG/NARCOTIC	99876

Change over time

Aggregate counts over long periods of time aren't particularly helpful. We can group by year to get the count of reports per year:

SELECT YEAR(Date) AS yr, COUNT(*)
  FROM sfpd_incidents
  GROUP BY yr
  ORDER BY yr

The count of reports by year:

yr	COUNT(*)
2003	152901
2004	150349
2005	144395
2006	136145
2007	135813
2008	139400
2009	137519
2010	121525
2011	117644
2012	123878
2013	132195

Again, not particularly interesting. We don't know if one kind of crime went up while another went down. So we can make our grouping more granular: the count of incident reports, by year, and by category:

SELECT Category, YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  GROUP BY yr, Category
  ORDER BY yr, Category

The result contains 403 rows, one for each combination of year and category of report:

Category	yr	count
ARSON	2003	294
ASSAULT	2003	13782
BAD CHECKS	2003	144
BRIBERY	2003	28
BURGLARY	2003	6088
DISORDERLY CONDUCT	2003	906
DRIVING UNDER THE INFLUENCE	2003	315
DRUG/NARCOTIC	2003	10176
DRUNKENNESS	2003	682
EMBEZZLEMENT	2003	246
…etc.
VEHICLE THEFT	2013	6209
WARRANTS	2013	7059
WEAPON LAWS	2013	1166

Reduce the clutter

Let's step back a bit. Instead of looking at every year, let's look at the 5-year change in reports, i.e. 2008 and 2013:

SELECT Category, YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  GROUP BY yr, Category
  HAVING yr = 2008 OR yr = 2013
  ORDER BY yr, Category

The result is 74 rows:

Category	yr	count
ARSON	2008	248
ASSAULT	2008	12670
BAD CHECKS	2008	78
BRIBERY	2008	49
BURGLARY	2008	5679
etc.
SEX OFFENSES, NON FORCIBLE	2013	13
STOLEN PROPERTY	2013	12
SUICIDE	2013	73
SUSPICIOUS OCC	2013	3358
TRESPASS	2013	991
VANDALISM	2013	6654
VEHICLE THEFT	2013	6209
WARRANTS	2013	7059
WEAPON LAWS	2013	1166

A time for pivoting

While this is a more manageable list, it's still not easy to discern which categories have changed the most from 2008 to 2013.

At this point, it's worth remembering why we turned to SQL in the first place: because our spreadsheet software couldn't handle an import of 1.5 million rows (Excel caps out at 1,048,576 rows). By using the GROUP BY aggregate clause, we could reduce the data rows enough to then export into a spreadsheet right off the bat. In fact, the kind of grouping and filtering we've just accomplished may have been easier in pivot tables.

So again, there is no one perfect data tool: if you feel more comfortable organizing data in a spreadsheet, then it's fine to use a database simply to open a large dataset and export the parts you need.

Data reshaping

Now that we have 74 rows of aggregate counts (for the year 2008 and 2013), we can simply copy and paste them into a spreadsheet, and arrange them so that we can do the "subtract one cell from the other" formulas that we're familiar with.

Pasting the list directly into a spreadsheet, and then doing a pivot table with Category as Rows and year as Columns, and then a SUM of the count, will generate something that looks like this:

Category	2008	2013
ARSON	248	228
ASSAULT	12670	10380
BAD CHECKS	78	26
BRIBERY	49	4
BURGLARY	5679	5847
DISORDERLY CONDUCT	789	460
DRIVING UNDER THE INFLUENCE	408	429
DRUG/NARCOTIC	11462	6682

Then we add two columns to calculate the change:

Change: the 2013 column minus the 2008 column
Change pct: the Change column, divided by the 2008 column. Then multiplied by 100 and rounded to make for a nice number.

When sorting the list by Change Pct, here are the top 10 categories based on drop in incidents from 2008 to 2013

Category	2008	2013	Change	Change pct
STOLEN PROPERTY	517	12	-505	-98
BRIBERY	49	4	-45	-92
LOITERING	441	52	-389	-88
PROSTITUTION	1663	269	-1394	-84
FORGERY/COUNTERFEITING	2460	710	-1750	-71
BAD CHECKS	78	26	-52	-67
LIQUOR LAWS	499	227	-272	-55
FAMILY OFFENSES	72	35	-37	-51
EXTORTION	52	30	-22	-42
DRUG/NARCOTIC	11462	6682	-4780	-42

Reversing that sort, we can get the top 10 categories according to increase in incidents from 2008 to 2013:

Category	2008	2013	Change	Change pct
KIDNAPPING	345	514	169	49
LARCENY/THEFT	25797	35661	9864	38
DRUNKENNESS	710	954	244	34
NON-CRIMINAL	12293	15343	3050	25
WARRANTS	5798	7059	1261	22
DRIVING UNDER THE INFLUENCE	408	429	21	5
VEHICLE THEFT	6053	6209	156	3
MISSING PERSON	4335	4458	123	3
BURGLARY	5679	5847	168	3
FRAUD	2556	2579	23	1

Video

Watch me write out the query, execute it, and paste the results into a Google Spreadsheet to find the change from 2008 to 2013 in incident reports. In this demo, I don't use Pivot Tables, but just do some extra cutting-and-pasting as needed:

Preview of JOINs

Note: This is just a technical aside, for those of you who wonder if there's a way to avoid spreadsheets altogether.

We can actually do this all in a SQL query with a JOIN clause and subqueries, which are two intermediate concepts we haven't yet learned. But here's a preview of the fun that will be:

SELECT 
  ty.Category, 
  COUNT(1) AS count_2008, 
  ty.count_2013, (ty.count_2013 - COUNT(1)) AS `Change`, 
  /* the difference in count */
  ((ty.count_2013 - COUNT(1)) / COUNT(1)) AS `Change pct`
  /* the difference expressed as a ratio of year 2008 incidents*/
  FROM sfpd_incidents 
    AS tx
  INNER JOIN /* create a new table of just 2013 incidents, name it `ty` */
    ( SELECT Category, COUNT(1) AS count_2013 
        FROM sfpd_incidents 
        WHERE YEAR(Date) = 2013 
        GROUP BY Category ) 
    AS ty
    ON tx.Category = ty.Category
  WHERE YEAR(tx.Date) = 2008
  GROUP BY ty.Category
  ORDER BY `Change pct`

Which results in:

Category	count_2008	count_2013	Change	Change pct
STOLEN PROPERTY	517	12	-505	-0.9768
BRIBERY	49	4	-45	-0.9184
LOITERING	441	52	-389	-0.8821
PROSTITUTION	1663	269	-1394	-0.8382
FORGERY/COUNTERFEITING	2460	710	-1750	-0.7114
BAD CHECKS	78	26	-52	-0.6667
LIQUOR LAWS	499	227	-272	-0.5451
FAMILY OFFENSES	72	35	-37	-0.5139
EXTORTION	52	30	-22	-0.4231
DISORDERLY CONDUCT	789	460	-329	-0.4170
etc. etc.

Pondering the numbers

Looking at the change in incident reporting over 2008 to 2013 brings up a few potential angles of investigation.

An increase in larceny

Among all the categories, "LARCENY/THEFT" has the biggest jump in numbers of reports: 25,797 to 35,661, in 2008 and 2013, respectively. Let's run a query on the database to see that jump compared to the years pre-2008:

SELECT YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  WHERE Category = 'LARCENY/THEFT'
  GROUP BY yr
  ORDER BY yr

yr	count
2003	26853
2004	24822
2005	25623
2006	27349
2007	25767
2008	25797
2009	25545
2010	23888
2011	25161
2012	30119
2013	35661

Graphing that table results in:

My first guess would have been to attribute the drastic rise in thefts to the widespread adoption of smartphones. New York Mayor Michael Bloomberg blamed an apparent spike in 2012 crime to Apple lovers:

Major crime in New York City inched up this year, and Mayor Michael R. Bloomberg on Friday fingered the culprit: too many iPhones and iPads were being swiped…

“If you just took away the jump in Apple, we’d be down for the year,” said Marc La Vorgna, the mayor’s press secretary.

On the radio, Mr. Bloomberg said that Apple products appeared to be the preference for many thieves, noting that he was not including thefts of competing devices, like the Samsung Galaxy, in his count.

From 2011 to 2012, theft reports jumped from 25,161 to 30,119, so San Francisco may have experienced the same trend, though a closer look at the subcategories of theft (i.e. theft from auto, pickpocketing, grand theft, etc.) is necessary.

A decrease in narcotics

Let's turn our attention to categories of reports that have decreased and give credit where credit is due. According to our query, DRUG/NARCOTIC reports have dropped by 4,780 (42%).

Of course, a lot has happened since 2008, including a relaxation of laws and attitudes towards cannabis, in California and across the nation. Is this the main cause of the drop in DRUG/NARCOTIC reports from 2008 to 2013 in San Francisco?

The easiest way to confirm this is to group the data reports by subcategory, or the Descript column:

SELECT Descript, YEAR(Date) as yr, COUNT(*) as count
  FROM sfpd_incidents
  WHERE Category = 'DRUG/NARCOTIC'
  GROUP BY yr, Descript
  HAVING yr = 2008 OR yr = 2013
  ORDER BY yr, Descript

Using the previous strategy of pasting the results into a pivot table and creating a spreadsheet of the calculated changes, here are the top subcategories of narcotics reports, sorted by change in count:

Descript	2008	2013	Change
POSSESSION OF BASE/ROCK COCAINE	1617	393	-1224
SALE OF BASE/ROCK COCAINE	1270	245	-1025
POSSESSION OF NARCOTICS PARAPHERNALIA	2359	1353	-1006
POSSESSION OF BASE/ROCK COCAINE FOR SALE	914	284	-630
POSSESSION OF MARIJUANA	1108	764	-344
SALE OF MARIJUANA	391	133	-258
POSSESSION OF MARIJUANA FOR SALES	571	350	-221
SALE OF CONTROLLED SUBSTANCE	233	78	-155
SALE OF HEROIN	174	43	-131
POSSESSION OF CONTROLLED SUBSTANCE	404	296	-108
POSSESSION OF HEROIN FOR SALES	199	120	-79
POSSESSION OF CONTROLLED SUBSTANCE FOR SALE	251	173	-78
POSSESSION OF COCAINE	267	214	-53
SALE OF COCAINE	70	20	-50

Marijuana-related reports do drop, but not as much as cocaine-related reports.

As always, our strategy should involve flipping our perspective: if cocaine-related reports have dropped, has any other kind of report increased?

Here is the table sorted in reverse-order of change:

Descript	2008	2013	Change
POSSESSION OF METH-AMPHETAMINE	383	956	573
POSSESSION OF METH-AMPHETAMINE FOR SALE	129	266	137
TRANSPORTATION OF METH-AMPHETAMINE	24	49	25
POSSESSION OF OPIATES	19	40	21
UNDER INFLUENCE OF DRUGS IN A PUBLIC PLACE	40	53	13

Meth-related reports have increased by nearly 140%, though the change in terms of actual number of reports (~700) is significantly smaller than the change in cocaine-related reports (~3,000).

Investigating a drop in prostitution reports

One more category of reports is worth looking at: while DRUG/NARCOTICS reports dropped by 42 percent from 2008 to 2013, PROSTITUTION reports dropped by 84 percent. Like drug-related offenses, prostitution is a category of crime in which attitudes have seemingly shifted. In 2008, more than 40 percent of San Francisco residents voted in favor of decriminalizing prostitution. Though the measure ultimately failed (59 percent opposed), the political debate was heated, thanks to the measure's surprising popularity:

SAN FRANCISCO — When Proposition K was added to Tuesday’s ballot, many people likely snickered at the possibility that San Francisco might take its place alongside such prostitute-friendly havens as Amsterdam and a few rural counties in nearby Nevada.

But this week, it became readily apparent that city officials are not laughing anymore about the measure, which would effectively decriminalize the world’s oldest profession in San Francisco. At a news conference on Wednesday, Mayor Gavin Newsom and other opponents seemed genuinely worried that Proposition K might pass.

“This is not cute. This is not fanciful,” Mr. Newsom said, standing in front of the pink-on-pink facade of a closed massage parlor in the Tenderloin district. “This is a big mistake.”

Since 2008, laws on prostitution haven't changed as significantly as they have for marijuana usage. But technology has most certainly affected all parts of sex and society, both legal and illegal activity, and maybe this has impacted the prostitution business? Or perhaps this drop in reports is simply a result of a coding policy. Let's query the dataset for possible explanations.

For reference's sake, I've posted a spreadsheet of the more than 12,000 incident reports categorized as "PROSTITUTION" from 2003 to 2013.

Visualizing location of incidents

Even in cities where prostitution is illegal, the term "red light district" is used to refer where a concentration of such activity may be found. So instead of looking at bar graphs, let's use the location data in the SFPD incident reports to get an assessment of how the geography of activity has changed, if at all.

The SQL query here is simple, because we're only concerned with getting data in a form usable by a mapping application.

SELECT 
  YEAR(Date) as yr, Descript, Date, Time, Y AS latitude, X AS longitude
  FROM sfpd_incidents
  WHERE Category = "PROSTITUTION"
  HAVING yr = 2008 OR yr = 2013

Using TileMill, we can generate a quick map. Here, the incidents in 2008 are orange, the incidents in 2013 are blue:

Here are the two years side-by-side:

Ignoring my crude mapping skills, we can at least tell that the number of reports may have decreased – the visual confusion is partly because records will be geocoded to a specific corner, so many of them will sit atop the exact same coordinate – but the general location of prostitution reports hasn't changed significantly.

Per subcategory?

As in the drug reports, it's worth taking a look at the Descript field:

SELECT Descript, Year(Date) as yr, COUNT(1) as c
  FROM sfpd_incidents
  WHERE Category = 'PROSTITUTION'
    GROUP BY yr, Descript
  HAVING yr = 2008 OR yr = 2013
    ORDER BY yr, Descript

Descript	2008	2013	Change	Change pct
INMATE/KEEPER OF HOUSE OF PROSTITUTION	8	0	-8	-100
SOLICITS TO VISIT HOUSE OF PROSTITUTION	987	1	-986	-99.9
SOLICITS LEWD ACT	12	1	-11	-91.7
ENGAGING IN LEWD CONDUCT - PROSTITUTION RELATED	21	5	-16	-76.2
SOLICITS FOR ACT OF PROSTITUTION	602	171	-431	-71.6
PLACING WIFE IN HOUSE OF PROSTITUTION	3	1	-2	-66.7
PIMPING	22	37	15	68.2
PANDERING	7	22	15	214.3
INDECENT EXPOSURE - PROSTITUTION RELATED	1	31	30	3000

The majority of PROSTITUTION reports fall under SOLICITS FOR ACT OF PROSTITUTION and SOLICITS TO VISIT HOUSE OF PROSTITUTION. Both have fallen drastically in a five-year period, though the latter subcategory has all been eliminated: going from 987 reports in 2008 to just one report in 2013.

Let's make a chart

Let's visually quantify the drop in the main two subcategories of prostitution reports, across all the years from 2003 to 2013:

SELECT Descript, YEAR(Date) as yr, COUNT(1) as x
  FROM sfpd_incidents
  WHERE Category = "PROSTITUTION"  
    AND 
      Descript = "SOLICITS FOR ACT OF PROSTITUTION" 
      OR Descript = "SOLICITS TO VISIT HOUSE OF PROSTITUTION"
  GROUP BY Descript, yr
  ORDER BY Descript, yr

The following results table can be pasted into a spreadsheet and pivoted (with yr creating the rows and Descript creating the columns).

Descript	yr	x
SOLICITS FOR ACT OF PROSTITUTION	2003	1111
SOLICITS FOR ACT OF PROSTITUTION	2004	825
SOLICITS FOR ACT OF PROSTITUTION	2005	732
SOLICITS FOR ACT OF PROSTITUTION	2006	645
SOLICITS FOR ACT OF PROSTITUTION	2007	899
SOLICITS FOR ACT OF PROSTITUTION	2008	602
SOLICITS FOR ACT OF PROSTITUTION	2009	521
SOLICITS FOR ACT OF PROSTITUTION	2010	476
SOLICITS FOR ACT OF PROSTITUTION	2011	385
SOLICITS FOR ACT OF PROSTITUTION	2012	224
SOLICITS FOR ACT OF PROSTITUTION	2013	171
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2003	843
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2004	627
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2005	291
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2006	563
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2007	876
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2008	987
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2009	855
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2010	117
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2011	7
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2012	2
SOLICITS TO VISIT HOUSE OF PROSTITUTION	2013	1

The resulting area chart:

Researching the data

Never take data at face value. This interesting drop in prostitution reports could be a result of a data glitch or ignorance on my part.

Look for previously-reported facts: On April 19, 2011, the San Francisco Examiner published an article titled, San Francisco police crack down on Polk Street prostitution. One of the numerical details:

More cops are on the streets making arrests, Mannix said. During a 24-hour period last week, at least four prostitutes were arrested, including a 16-year-old runaway.

Translating the above statement into a query:

    SELECT IncidntNum, Descript, Date, Time, PdDistrict, Resolution, Location
    FROM sfpd_incidents
    WHERE Category = 'PROSTITUTION'
      AND Date BETWEEN  '2011-04-11' AND '2011-04-19'
      AND Location LIKE '%POLK%'

IncidntNum	Descript	Date	Time	PdDistrict	Resolution	Location
110296727	PANDERING	2011-04-12	01:38	NORTHERN	ARREST, BOOKED	PINE ST / POLK ST
110296727	PIMPING	2011-04-12	01:38	NORTHERN	ARREST, BOOKED	PINE ST / POLK ST
110302437	SOLICITS FOR ACT OF PROSTITUTION	2011-04-13	23:00	NORTHERN	ARREST, CITED	CALIFORNIA ST / POLK ST
110302540	SOLICITS FOR ACT OF PROSTITUTION	2011-04-14	00:05	NORTHERN	ARREST, CITED	POLK ST / CALIFORNIA ST
110302653	SOLICITS FOR ACT OF PROSTITUTION	2011-04-14	01:05	NORTHERN	ARREST, CITED	POLK ST / CALIFORNIA ST
110308758	SOLICITS FOR ACT OF PROSTITUTION	2011-04-16	00:35	NORTHERN	ARREST, CITED	POLK ST / CALIFORNIA ST
110308952	SOLICITS FOR ACT OF PROSTITUTION	2011-04-16	02:15	NORTHERN	ARREST, CITED	POLK ST / CALIFORNIA ST
110309160	SOLICITS FOR ACT OF PROSTITUTION	2011-04-16	03:40	NORTHERN	ARREST, CITED	POLK ST / CALIFORNIA ST
110308827	SOLICITS FOR ACT OF PROSTITUTION	2011-04-16	01:29	NORTHERN	ARREST, CITED	POLK ST / CALIFORNIA ST

There were four prostitution-related arrests on April 16, but frankly, that could just be a coincidence. Capt. Mannix doesn't specify if those four arrests all happened on Polk Street, for example.

Another story of a sweep, published Jan. 20, 2010: Prostitution sting nets 51 so far

A crackdown on street prostitution in response to complaints from Polk Street and Russian Hill residents has netted 51 arrests, but officials say the crackdown can’t last forever…Of the 51 people arrested so far in the current operation, 36 were women accused of prostitution offenses, according to Lt. Tom Cleary of the vice crimes unit. Fifteen were men paying for sex. Cleary emphasized that a majority of those arrested came from outside The City.

Using this aggregate SQL:

  SELECT SUBSTRING(Date, 1, 7) as yearmth, COUNT(1) AS count
  FROM sfpd_incidents
    WHERE Category = 'PROSTITUTION' AND pddistrict = 'NORTHERN'
      AND YEAR(Date) BETWEEN  2008 AND 2013
    GROUP BY yearmth
    order by count DESC

– we observe that the month of January 2010 seems to be a high-water mark in prostitution reports from the Northern district (which covers Polk St. and Russian Hill):

yearmth	count
2010-01	57
2008-07	47
2008-02	46
2008-06	40
2008-10	38
2008-01	36

Sociological reasons?

Should an 84% drop in prosecution reports from 2008 to 2013 be a surprise, given voter support for decriminalizing prostitution? Whatever the ambivalence of the average resident, I didn't find any direct reports of the SFPD deciding to go lenient on prostitution.

On June 16, 2010, the SF Weekly took an in-depth look at the SFPD's arguably aggressive anti-prostitution stings:

In San Francisco, where 41 percent of voters approved a measure that would have essentially decriminalized prostitution and where a sex worker is currently running for the Board of Supervisors, men are regularly arrested just for talking to undercover cops posing as streetwalkers. If a man avoids eye contact and walks on by, he's fine. But if he shows any interest, the decoy will approach him. By the time he knows she is selling sex, he's headed for trouble. After that, any expression of interest — even a sarcastic "Oops, don't have enough cash, guess I'll come back later" — is enough to land him in handcuffs…

A decade ago, almost all the men arrested in the program's stings had shown more than interest; they had agreed to sex acts for money, according to statistics from the District Attorney's office. But in recent years, as would-be johns have gotten savvy to police tactics, roughly a third of those arrested in the stings — between 100 and 200 a year — were cited simply for demonstrating, in the eyes of the police, that they had "an intent to commit prostitution." According to the California penal code, there's no "single circumstance or combination of circumstances" that proves this intent; it has to be determined case by case.

The article cites a figure from a 2009 audit of the city's budget: previously, the sting program was funded by the fees paid by johns. However, "as prostitution has shifted to the Internet, it has become more difficult for police officers to make arrests in street-level stings. Since 2005, the number of johns arrested has dropped steadily."

In 2011, the San Francisco Public Press reported on the controversial use of anti-trafficking resources to do street-level sweeps.

This past summer, San Francisco prostitution made the news again when business owners complained about activity in Union Square.

This article quotes police as being more interested the stemming the bigger issues surrounding prostitution, rather than focusing on street-level busts:

Police said they plan to clean up Union Square, but also want to focus on human trafficking and helping women who have been forced into prostitution.

So what's the explanation for the 84% drop in SFPD prostitution reports? I sent a request to SFPD's public information officer but haven't heard back. I'm still open to the possibility of a mis-query on my part, or a coding change in the SFPD CABLE system that classifies the reports differently. Either way, I hope this has given you some ideas on how to query data along a variety of dimensions and time frames to find leads for what may be interesting stories.

Now all you have to do is grind in the syntax. Good luck!

Homework

Read the SQL tutorials

Due by next class

I’ve put together a series of step-by-step tutorials on basic SQL syntax. Read through them and repeat every exercise if necessary. Don’t just copy and paste; sometimes, typing out the queries forces you to slow down and think them through.

Investigate your own name

Due by next class
Using the Social Security Administration’s database of baby names, find out how popular (or unpopular) your name has been since 1980 (or 1880, if you wish).

Download the appropriate file for your system. You may want to try the 1980 file if your computer is having trouble with the full 1880 file.

SQLite
- Names since 1980
- Names since 1880
SQL
- Names since 1980
- Names since 1880
To get some inspiration, you can check out the SSA baby names website in which they let you perform a limited search on names.

Please do this homework in a Markdown file in your private Github repo and call it, sql-baby-names.md

Using the ssa_baby_names table
1. In the year 2013, find out how many babies had your name
2. Find the highest-rank (and the year) that your name has achieved among baby names
3. Between the years 1980 and 2013, find how many babies in total have been listed by the Social Security Administration in this data.
4. Find out how many babies (roughly) have had your name from 1980 to 2013
5. Find the year that had the most male babies born
6. Find the year in which your name had the highest increase in names-per-100k-babies born
7. Find the year in which your name had the highest decrease in names-per-100k-babies born
8. Make a line chart showing how your name has changed in popularity over the years
9. Find out who in our class had the most popular baby name in 1980.
10. Find out who in our class had the most popular baby name in 2013.
11. Find out who in our class has the name that the most babies throughout U.S. history have.
Using the ssa_baby_names_by_state table

For this section, pick a adoptive state if you were not born in the U.S.
1. In your home state, find out how many babies had your name in the year that you were born.
2. In your home state, find out how many babies had your name in 2013
3. Find the state in which your name is the most popular in 2013
4. Find the state in which your name is least popular (but still registers, i.e. has a minimum of 5 babies) in 2013
5. Make a line chart with two lines:
  1. The popularity of your name in your home state
  2. The popularity of your name in California
  (If you were born in California, pick another state, like Florida.)

Course schedule

Tuesday, September 23

The singular of data is anecdote

An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Thursday, September 25

Bad big data

Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Tuesday, September 30

DIY Databases

Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Thursday, October 2

Data in the newsroom

Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Tuesday, October 7

The points of maps

Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
Thursday, October 9

The shapes of maps

A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
Tuesday, October 14

Introduction to SQL for Data Journalism

The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
Thursday, October 16

A needle in multiple haystacks

The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Tuesday, October 21

Haystacks without needles

Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
Thursday, October 23

Midterm Malarkey with Military Surplus

A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
Tuesday, October 28

Campaign Cash Check

The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
Thursday, October 30

Predicting the elections

With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
- Forecasting
Tuesday, November 4

Election day (No class)

Do your on-the-ground reporting
- No class because of Election Day Coverage
Thursday, November 6

Storytelling with Data Visualization

While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
Tuesday, November 11

Dirty data, cleaned dirt cheap

One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
- OpenRefine
- Clustering
Thursday, November 13

Guest speaker: Simon Rogers

Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
Tuesday, November 18

What we say and what we do

When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Thursday, November 20

Project prep and discussion

Discussion of final projects before the Thanksgiving break.
Tuesday, November 25

Thanksgiving break

Holiday - no class
Thursday, November 27

Thanksgiving break

Holiday - no class
Tuesday, December 2

Project wrapup

Last-minute help on final projects.
Thursday, December 4

Project Show-N-Tell

In-class presentations of our final data projects.

Tuesday, October 14

Introduction to SQL for Data Journalism

Topics

Homework

SQL in the wild

SQL Hell

Bye-bye spreadsheets?

Exploring SFPD data with SQL and spreadsheets

Simple queries

Aggregate queries

Change over time

Reduce the clutter

A time for pivoting

Data reshaping

Video

Preview of JOINs

Pondering the numbers

An increase in larceny

A decrease in narcotics

Investigating a drop in prostitution reports

Visualizing location of incidents

Per subcategory?

Let's make a chart

Researching the data

Sociological reasons?

Homework

Read the SQL tutorials

Investigate your own name

SQLite

SQL

Using the ssa_baby_names table

Using the ssa_baby_names_by_state table

Course schedule

Using the `ssa_baby_names` table

Using the `ssa_baby_names_by_state` table