Week 12: Drawing the rest of the owl

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.formula.api as smf
from matplotlib import colors as mpc

More fun with sharks

We're going to revisit the shark attacks data from last week. This week, we'll focus on analyzing and visualizing this data. This will mostly serve as a grab-bag of extra tricks and tools that might be useful (but not essential) for your projects.

Why sharks? As summarized by Vox:

Consider the curious case of New Jersey in 1916: That summer, there was a string of deadly shark attacks along the Jersey Shore. As a result, Woodrow Wilson lost his home state in the presidential election. Why, you ask? Because the beachfront towns (which rely on tourism) were negatively impacted by the attacks. Though Wilson wasn’t responsible for the hungry sharks, he was the incumbent, and people vote against incumbents when things are bad. This is a story political scientists Christopher Achen and Larry Bartels tell in Democracy for Realists, in service of a sobering thesis: Voters don’t have anything like coherent preferences. Most people pay little attention to politics; when they vote, if they vote at all, they do so irrationally and for contradictory reasons.

We'll load the data (in friendly, generic CSV format this time, but see the appendix of last week's practicum for details about using pd.read_csv() with different version of CSV/TSV files.

In [2]:
sharks = pd.read_csv("shark.csv")
sharks
Out[2]:
county wilson1912 wilson1916 beach machine mayhew attack coastal
0 ATLANTIC 0.360453 0.360122 1 0 0 0 1
1 BERGEN 0.421218 0.383590 0 1 0 0 1
2 BURLINGTON 0.413043 0.425662 0 0 0 0 0
3 CAMDEN 0.393972 0.433054 0 0 1 0 1
4 CAPE MAY 0.435052 0.419388 1 0 0 0 1
5 CUMBERLAND 0.391534 0.445745 0 0 0 0 1
6 ESSEX 0.341719 0.389691 0 1 1 0 1
7 GLOUCESTER 0.404008 0.412017 0 0 0 0 1
8 HUDSON 0.551471 0.512270 0 1 1 0 1
9 HUNTERDON 0.544343 0.566937 0 0 0 0 0
10 MERCER 0.381743 0.427692 0 0 1 0 0
11 MIDDLESEX 0.455010 0.456853 0 0 1 0 1
12 MONMOUTH 0.495918 0.479798 1 0 0 1 1
13 MORRIS 0.420335 0.443532 0 0 0 0 0
14 OCEAN 0.384458 0.380182 1 0 0 1 1
15 PASSAIC 0.388158 0.415433 0 0 1 0 0
16 SALEM 0.463515 0.451020 0 0 0 0 1
17 SOMERSET 0.432790 0.436805 0 0 0 0 0
18 SUSSEX 0.543500 0.556691 0 0 0 0 0
19 UNION 0.411765 0.382046 0 1 0 0 1
20 WARREN 0.576963 0.619147 0 0 0 0 0

Our dataset has 21 rows (counties), and the following columns:

  • county: the name of a county in New Jersey
  • wilson1912: Woodrow Wilson’s (three-party) share of the vote in 1912
  • wilson1916: Woodrow Wilson’s (two-party) share of the vote in 1916
  • beach: does the county have substantial beach-related tourism?
  • machine: were the politics of this county run by a political machine?
  • mayhew: an alternative specification of machine
  • attack: was there a shark attack in this county?
  • coastal: is the county located on the coast?

Additionally, we'll create another column from the wilson1912 and wilson1916 vote shares. Note that with vectorized operations, we don't need a loop, we can just subtract the two pandas Series directly:

In [3]:
sharks['change'] = sharks['wilson1912'] - sharks['wilson1916']

Regression modeling with statsmodels

In some cases, researchers are interested in the effect of some (explanatory) variable on another (outcome) variable, conditional on other variables (covariates). For example, what is the effect of being a beach county on support for Woodrow Wilson in 1916, conditional on whether or not the county was part of a political machine and the county's 1912 vote share for Wilson? This is the model that Achen and Bartels used when they concluded that shark attacks negatively affected Wilson's reelection prospects.

We saw how to use linear regression in DS 2000 using scikit-learn. sklearn provides a useful interface for machine learning models; in cases where you need basic statistical models like linear or logistic regression, the statsmodels may provide more useful output (and output that looks more familiar to users of, say, R or Stata, like might be used in an econometrics class). sklearn also sometimes sets defaults that, while reasonable to the machine learning practitioner, are unexpected to the social scientist (notably in the case of logistic regression).

statsmodels provides two interfaces for specifying a model formula. I'm going to focus on the one provided by statsmodels.formula because it's the same as R (portability across languages is good), and because I find the other one to be nearly incomprehensible. This involves defining a formula as a string, then using that formula in combination with a Pandas dataframe. To regress an outcome (Y) on our explanatory variable (D) and our covariates (X1, X2, ...), we define a formula like

"Y ~ D + X1 + X2"

So in terms of the shark attacks example above, given a pandas DataFrame like the following:

In [4]:
sharks.head()
Out[4]:
county wilson1912 wilson1916 beach machine mayhew attack coastal change
0 ATLANTIC 0.360453 0.360122 1 0 0 0 1 0.000331
1 BERGEN 0.421218 0.383590 0 1 0 0 1 0.037629
2 BURLINGTON 0.413043 0.425662 0 0 0 0 0 -0.012618
3 CAMDEN 0.393972 0.433054 0 0 1 0 1 -0.039082
4 CAPE MAY 0.435052 0.419388 1 0 0 0 1 0.015664

We can write our regression formula as

wilson1916 ~ beach + wilson1912 + machine

After coming up with a model specification, we can fit the model like so:

In [5]:
m = smf.ols("wilson1916 ~ beach +  wilson1912 + machine", data = sharks).fit()
print(m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             wilson1916   R-squared:                       0.904
Model:                            OLS   Adj. R-squared:                  0.887
Method:                 Least Squares   F-statistic:                     53.58
Date:                Tue, 19 Nov 2019   Prob (F-statistic):           7.11e-09
Time:                        20:25:00   Log-Likelihood:                 52.196
No. Observations:                  21   AIC:                            -96.39
Df Residuals:                      17   BIC:                            -92.21
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0815      0.034      2.366      0.030       0.009       0.154
beach         -0.0345      0.013     -2.658      0.017      -0.062      -0.007
wilson1912     0.8662      0.076     11.425      0.000       0.706       1.026
machine       -0.0384      0.013     -2.983      0.008      -0.065      -0.011
==============================================================================
Omnibus:                        3.390   Durbin-Watson:                   1.817
Prob(Omnibus):                  0.184   Jarque-Bera (JB):                2.055
Skew:                           0.760   Prob(JB):                        0.358
Kurtosis:                       3.197   Cond. No.                         19.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model summary gives us a lot of output, most of which we won't focus on. In the middle, we have a table of model terms and their associated coefficients and standard errors. So, interpreting our explanatory variable (beach), we find that a unit change in beach (i.e., comparing a non-beach county to a beach county) is associated with a change in Wilson's 1916 vote share of -0.034. There is uncertainty around this estimate, but we are 95% confident the "true" (population) estimate is between -0.062 and -0.007; i.e., we are 95% confident being a beach county is associated with some sort of decrease in Wilson's vote share, conditional on Wilson's 1912 vote share and the presence of political machines.

We also observe in the top right the R-squared value, which we briefly touched on in DS 2000. We see that this model explains a lot of the underlying variation in the data. (This is unsurprising in the current context; most counties that vote Democrat continue to do so in the future.)

Note that not all stories require regression models. Indeed, a key contention in the literature is that regression is inappropriate for telling this story: there are few observations, the observations probably aren't indepedent, etc. Futhermore, if you want to convey answers to a general audience, telling a story using data visualization may be more appropriate. For example, considering the following visualization:

In [6]:
fig3, ax3 = plt.subplots(figsize=(12, 8))

beach_counties = sharks[sharks.beach==1]
nonbeach_counties = sharks[sharks.beach==0]


ax3.scatter(y=beach_counties['county'].index, x=beach_counties['wilson1912'], marker='o', color='crimson', label='1912 (beach)', s = 70)
ax3.scatter(y=beach_counties['county'].index, x=beach_counties['wilson1916'], marker='o', color='crimson', label='1916 (beach)', facecolor='none', s = 70)

ax3.scatter(y=nonbeach_counties['county'].index, x=nonbeach_counties['wilson1912'], marker='o', color='darkcyan', label='1912 (not beach)', s = 70)
ax3.scatter(y=nonbeach_counties['county'].index, x=nonbeach_counties['wilson1916'], marker='o', color='darkcyan', label='1916 (not beach)', facecolor='none', s = 70)

for idx, row in beach_counties.iterrows():
    ax3.arrow(y=idx, x=row['wilson1912'], dy=0, dx=-row['change'], length_includes_head=True, color='crimson', width=0, linewidth=1.25)
for idx, row in nonbeach_counties.iterrows():
    ax3.arrow(y=idx, x=row['wilson1912'], dy=0, dx=-row['change'], length_includes_head=True, color='darkcyan', width=0, linewidth=1.25)
    
ax3.set_yticks(sharks['county'].index)
ax3.set_yticklabels(sharks['county'])
ax3.set_ylabel("County")
ax3.set_xlabel("Wilson vote share")
ax3.grid(alpha=0.5)
ax3.set_title("Shift in vote shares by county")

ax3.legend();

From this visualization we can see that, although the shifts in vote shares varied a bit, in beach counties they were all negative, whereas for non-beach counties some increased and some decreased.

A visualization like this only works because we have few units under consideration (or because we are aggregating to only a few units.) With a lot of units, we would need to consider other techniques. But this example shows how we can use color and shapes to relate multiple variables to tell a story.

Mapping data

Many projects involve spatial data in some sense. A common technique in visualizing spatial data is the choropleth map, which represents the value of some variable on the map. If you have GIS/shapefile data, making these plots is not too painful with geopandas.

NOTE: this code may not run; you need to have geopandas and descartes installed.

In [7]:
%conda install geopandas
%conda install descartes
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.

We'll consider the same shark attack data, using publicly available shapefile data for New Jersey.

In [8]:
import geopandas as gp

Geopandas behaves more or less like Pandas, but you need multiple files (.shp, .shx, etc.) to be present for gp.read_file() to work.

In [9]:
nj = gp.read_file("NJ/New_Jersey_Counties.shp")
nj.head()
Out[9]:
OBJECTID COUNTY COUNTY_LAB CO GNIS_NAME GNIS FIPSSTCO FIPSCO ACRES SQ_MILES ... POP1980 POPDEN2010 POPDEN2000 POPDEN1990 POPDEN1980 REGION GLOBALID Shape_Leng Shape_Area geometry
0 1 ATLANTIC Atlantic County ATL County of Atlantic 882270 34001 1 390815.404215 610.649069 ... 204615 450 414 451 335 COASTAL {FF3231FE-B82C-4DCB-A36E-199C86B99D48} 650228.153906 1.702385e+10 POLYGON ((443056.010 311529.770, 443068.440 31...
1 2 BERGEN Bergen County BER County of Bergen 882271 34003 3 153490.275182 239.828555 ... 849843 3774 3686 3459 3544 NORTHEASTERN {08D52232-487C-40B1-AF3A-94DE390C8BDD} 443326.122021 6.686010e+09 POLYGON ((656200.963 783614.366, 656141.127 78...
2 3 BURLINGTON Burlington County BUR County of Burlington 882272 34005 5 525002.683590 820.316693 ... 362542 547 516 482 442 SOUTHERN {5DB6BAA6-5945-4004-8837-ECE826970808} 788906.717476 2.286903e+10 POLYGON ((436626.690 491361.780, 436701.190 49...
3 4 CAMDEN Camden County CAM County of Camden 882273 34007 7 145650.507808 227.578918 ... 471650 2257 2236 2340 2073 SOUTHERN {944EE24D-97B4-4666-9963-FC6BAE3D9334} 495010.034901 6.344511e+09 POLYGON ((342763.990 423475.830, 342804.100 42...
4 5 CAPE MAY Cape May County CAP County of Cape May 882274 34009 9 183126.574240 286.135272 ... 82266 340 358 332 288 COASTAL {72F314BF-ED7C-4C51-A7A0-E841972BC422} 514564.858771 7.976962e+09 POLYGON ((390389.730 177065.130, 390410.170 17...

5 rows × 23 columns

nj is a Pandas Dataframe, like other data frames, just filled with incomprehensible (to me, at least) GIS nonsense. But, because it's just a Dataframe, it can be joined with other datasets without too much trouble. First, to make things easier, we'll rename the county variable in our sharks dataset to COUNTY to make the join a little easier. Note that both of these datasets have the county names in the same format; i.e., they both have "ATLANTIC" rather than "Atlantic" or "Atlantic County".

To rename a column, we can use the df.rename() method, which takes a dictionary where the keys are the original names of the columns, and the values are the new names we want. We also need to pass axis=1 to indicate that we want to rename the columns rather than the rows.

In [10]:
sharks = sharks.rename({"county": "COUNTY"}, axis=1)
sharks.head()
Out[10]:
COUNTY wilson1912 wilson1916 beach machine mayhew attack coastal change
0 ATLANTIC 0.360453 0.360122 1 0 0 0 1 0.000331
1 BERGEN 0.421218 0.383590 0 1 0 0 1 0.037629
2 BURLINGTON 0.413043 0.425662 0 0 0 0 0 -0.012618
3 CAMDEN 0.393972 0.433054 0 0 1 0 1 -0.039082
4 CAPE MAY 0.435052 0.419388 1 0 0 0 1 0.015664

Now we can merge these datasets using the df.merge() method. The on parameter specifies the name of the column we want to use for the merge.

In [11]:
merged_data = nj.merge(sharks, on='COUNTY')
merged_data.head()
Out[11]:
OBJECTID COUNTY COUNTY_LAB CO GNIS_NAME GNIS FIPSSTCO FIPSCO ACRES SQ_MILES ... Shape_Area geometry wilson1912 wilson1916 beach machine mayhew attack coastal change
0 1 ATLANTIC Atlantic County ATL County of Atlantic 882270 34001 1 390815.404215 610.649069 ... 1.702385e+10 POLYGON ((443056.010 311529.770, 443068.440 31... 0.360453 0.360122 1 0 0 0 1 0.000331
1 2 BERGEN Bergen County BER County of Bergen 882271 34003 3 153490.275182 239.828555 ... 6.686010e+09 POLYGON ((656200.963 783614.366, 656141.127 78... 0.421218 0.383590 0 1 0 0 1 0.037629
2 3 BURLINGTON Burlington County BUR County of Burlington 882272 34005 5 525002.683590 820.316693 ... 2.286903e+10 POLYGON ((436626.690 491361.780, 436701.190 49... 0.413043 0.425662 0 0 0 0 0 -0.012618
3 4 CAMDEN Camden County CAM County of Camden 882273 34007 7 145650.507808 227.578918 ... 6.344511e+09 POLYGON ((342763.990 423475.830, 342804.100 42... 0.393972 0.433054 0 0 1 0 1 -0.039082
4 5 CAPE MAY Cape May County CAP County of Cape May 882274 34009 9 183126.574240 286.135272 ... 7.976962e+09 POLYGON ((390389.730 177065.130, 390410.170 17... 0.435052 0.419388 1 0 0 0 1 0.015664

5 rows × 31 columns

Once we have this data merged, we can use df.plot() to make a choropleth map. The parameter legend=True gives us a colorbar to aid interpretation.

In [12]:
fig, ax = plt.subplots(figsize = (12, 8))
merged_data.plot(column='wilson1912', ax=ax, legend=True)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5b3bc6b00>
In [13]:
fig, ax = plt.subplots(figsize = (12, 8))
merged_data.plot(column='coastal', ax=ax, legend=True)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5b3d57320>

Using color in data visualization

It's nice that we can make these plots pretty easily, but they also point to an issue with using default values, especially around color, in data visualization. df.plot() is assuming that the data being visualized is a continuously valued variable where we're just indicating more or less of something. In neither of the above cases is this really true; in the Wilson vote share example, it does indicate more Wilson vote share, but it also matters whether or not it's close to 50\%. In the coastal counties case, it's given us appropriately diverging values for 0 and 1, but they're kind of ugly.

We want to use color to better tell the stories here. For this, we'll choose new colormaps for our data by varying the cmap parameter. A list of colormaps available in matplotlib by default can be found here. A key concept here is the following trichotomous classification of colormaps:

  • Sequential colormaps are for when you want to indicate how much of something there is (e.g., population).
  • Diverging colormaps are for when some intermediate value is of importance, and you want to indicate how much more or less there is something relative to that intermediate value (e.g., election results, where 50\% would mean a tie, less would mean Candidate A wins, and more would mean Candidate B wins).
  • Qualitative colormaps are for categorical data (e.g., binary variables like coastal versus non-coastal).

I've included some of the graphical examples of colormaps below. (For the love of God, don't use Jet even for your sequential colormaps.)

We want to visualize our binary variables using qualitative colormaps and our election-related continuous variables using diverging colormaps. We also want to specify the midpoint for our diverging colormaps. To do that, we'll create a DivergingNorm, which maps our values to colors on the colormap. We know that we want vcenter, the mid-point, to be 0.5, but we also need to figure out what appropriate minimum and maximum values should be. These should probably be equally far from the midpoint, so let's look at the mins and maxs of our data for guidance.

In [14]:
print(min(sharks['wilson1912']), max(sharks['wilson1912']) , min(sharks['wilson1916']), max(sharks['wilson1916']) )
0.3417190611362457 0.5769633054733276 0.36012208461761475 0.6191467046737671

So, for plotting vote shares, we'll use vmin=0.3 and vmax=0.7. We could use vmin=0 and vmax=1, which are the constraints on the data (i.e., you can't win more that 100\% of the vote), but these values will squash all the colors close the midpoint, and create a misleading impression; winning 70\% of a county is a very decisive victory.

So, making a very large plot summarizing lots of these variables, we can combine our new geopandas-based chloropleth maps with other things we have learned about visualizing data with matplotlib. We can make some subplots, plot each map on a different axis, remove the (unhelpful in this case) axis labels, and set titles. In our case, it's clear from context which value is associate with the gray color and which is associated with the green color; if we didn't, we would need to come up with some way of conveying that.

In [15]:
fig, axes = plt.subplots(figsize=(16, 12), nrows=2, ncols=4)
((ax1, ax2, ax3, ax4), (ax5, ax6, ax7, ax8)) = axes

norm1 = mpc.DivergingNorm(vmin=0.3, vcenter=0.5, vmax=0.7)
norm2 = mpc.DivergingNorm(vmin=-0.1, vcenter=0, vmax=0.1)

merged_data.plot(column='wilson1912', ax=ax1, cmap = 'bwr', legend=True, norm=norm1, edgecolor='black', linewidth=0.3)
merged_data.plot(column='wilson1916', ax=ax2, cmap = 'bwr', legend=True, norm=norm1, edgecolor='black', linewidth=0.3)
merged_data.plot(column='change', ax=ax3, cmap = 'bwr', legend=True, norm=norm2, edgecolor='black', linewidth=0.3)

merged_data.plot(column='attack', ax=ax4, cmap = 'Accent', edgecolor='black', linewidth=0.3)
merged_data.plot(column='beach', ax=ax5, cmap = 'Accent', edgecolor='black', linewidth=0.3)
merged_data.plot(column='coastal', ax=ax6, cmap = 'Accent', edgecolor='black', linewidth=0.3)
merged_data.plot(column='machine', ax=ax7, cmap = 'Accent', edgecolor='black', linewidth=0.3)
merged_data.plot(column='mayhew', ax=ax8, cmap='Accent', edgecolor='black', linewidth=0.3)

ax1.set_title("Wilson 1912 vote share")
ax2.set_title("Wilson 1916 vote share")
ax3.set_title("Change in Wilson vote share")
ax4.set_title("Locations of shark attacks")
ax5.set_title("Beach counties")
ax6.set_title("Coastal counties")
ax7.set_title("Political machines")
ax8.set_title("Political machines (alt. specification)")
#fig.colorbar(plt1, ax=ax1)
for row in axes:
    for ax in row:
        ax.set_xticks([])
        ax.set_yticks([])

Appendix: Colormaps

Appendix II: The Discourse about Shark Attacks

Links are clickable, going to the source documents.

In [ ]: