This is a quick and dirty introduction to data analysis in R. The goals are to:
It is not anything resembling a course on statistics and data science.
I am a petty dictator, and have added some goals about things I care about:
Consider the curious case of New Jersey in 1916: That summer, there was a string of deadly shark attacks along the Jersey Shore. As a result, Woodrow Wilson lost his home state in the presidential election.
Why, you ask? Because the beachfront towns (which rely on tourism) were negatively impacted by the attacks. Though Wilson wasn’t responsible for the hungry sharks, he was the incumbent, and people vote against incumbents when things are bad.
I gave a similar but drier tutorial last year; a copy of that document is here. If you have questions about R’s bizarre and terrible type system, or why there are so many <-
s littered throughout the code, check it out.
You’re going to want the following libraries (hopefully already installed):
The tidyverse
meta-package provides all the needed libraries, except for devtools.
RStudio is the IDE for R. Accept no substitutes.
R | Python | |
---|---|---|
Assignment | <- |
= |
Import (i) | library(x) |
from x import * |
Import (ii) | N/A | import x |
Calling Libraries (i) | library(lib); func() |
from lib import func; func() |
Calling Libraries (ii) | lib::func() |
import lib; lib.func() |
Concetenation | c(x, y) |
[x] + [y] |
String Concatenation | paste('x', 'y') |
'x' + ' y' |
Pipes | x %>% f(y) |
f(x, y) |
Conditionals | ifelse(cond, 1, 2) |
if cond: 1; else: 2 |
Package Installation | install.packages(x) |
pip install x |
Vectorized Math | c(1, 2) + c(1, 2) |
np.array([1,2]) + np.array([1,2]) |
The data is taken from Fowler and Hall’s critique of an earlier paper on shark attacks. I’ve converted their Stata files to CSV for convenience; they’re on the tutorial website.
readr
To load a data file, readr
provides a consistent interface across formats (and if readr
can’t load it, try haven
). Thus, we’ll use the library’s read_csv
function instead of base R’s read.csv
.
(Note that .
is valid in function and variable names in R; that is, read.csv
is not a method of a class read
.)
# sharks <- read_csv("~/git/r_tutorial_f19/resources/shark.csv")
sharks <- read_csv("https://sdmccabe.github.io/r_tutorial_f19/resources/shark.csv")
## Parsed with column specification:
## cols(
## county = col_character(),
## wilson1912 = col_double(),
## wilson1916 = col_double(),
## beach = col_double(),
## machine = col_double(),
## mayhew = col_double(),
## attack = col_double(),
## coastal = col_double()
## )
Note that read_csv
treats URLs and file paths the same when reading in a file. Storing a local copy of a file is almost always preferable, but using URLs can be convenient in some cases (like this one).
dim(sharks)
head(sharks)
## [1] 21 8
## # A tibble: 6 x 8
## county wilson1912 wilson1916 beach machine mayhew attack coastal
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATLANTIC 0.360 0.360 1 0 0 0 1
## 2 BERGEN 0.421 0.384 0 1 0 0 1
## 3 BURLINGTON 0.413 0.426 0 0 0 0 0
## 4 CAMDEN 0.394 0.433 0 0 1 0 1
## 5 CAPE MAY 0.435 0.419 1 0 0 0 1
## 6 CUMBERLAND 0.392 0.446 0 0 0 0 1
So, we have a data frame containing 21 observations of 8 attributes.
$
Columns of a data frame are indexed with the $
operator, so we can pull out a single column like so:
sum(sharks$beach)
## [1] 4
There are four beach counties in New Jersey.
We can also use numeric indices to pull out rows or columns:
head(sharks[,4]) # column indexing
## # A tibble: 6 x 1
## beach
## <dbl>
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 0
sharks[1:4,] # row indexing, plus showing 1:4 to generate a
# sequence from 1 to 4
## # A tibble: 4 x 8
## county wilson1912 wilson1916 beach machine mayhew attack coastal
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATLANTIC 0.360 0.360 1 0 0 0 1
## 2 BERGEN 0.421 0.384 0 1 0 0 1
## 3 BURLINGTON 0.413 0.426 0 0 0 0 0
## 4 CAMDEN 0.394 0.433 0 0 1 0 1
This can be handy for quick and dirty operations but is less explicit than the $
operator. Note also that R is one-indexed.
R has most univariate and bivariate summary statistics built in, so they can be acessed rather simply:
mean(sharks$wilson1912)
cor(sharks$wilson1912, sharks$wilson1916)
## [1] 0.4386176
## [1] 0.9121978
summary()
functionA particularly useful function here is summary
, which can be applied across an entire data frame:
summary(sharks)
## county wilson1912 wilson1916 beach
## Length:21 Min. :0.3417 Min. :0.3601 Min. :0.0000
## Class :character 1st Qu.:0.3915 1st Qu.:0.4120 1st Qu.:0.0000
## Mode :character Median :0.4203 Median :0.4331 Median :0.0000
## Mean :0.4386 Mean :0.4475 Mean :0.1905
## 3rd Qu.:0.4635 3rd Qu.:0.4569 3rd Qu.:0.0000
## Max. :0.5770 Max. :0.6191 Max. :1.0000
## machine mayhew attack coastal
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :1.000
## Mean :0.1905 Mean :0.2857 Mean :0.09524 Mean :0.619
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.000
Summary statistics are helpful but are no substitute for a visual understanding of a dataset. To illustrate, consider Anscombe’s Quartet.
anscombe <- datasets::anscombe # anscombe should already be in your namespace
# but this makes it explicit
# note that :: pulls something from a library
# compare to, say, the . in np.zeros()
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
round(map_dbl(anscombe[,5:8], mean), 3)
round(map_dbl(anscombe[,5:8], sd), 3)
round(map2_dbl(anscombe[,1:4], anscombe[,5:8], cor), 3)
## y1 y2 y3 y4
## 7.501 7.501 7.500 7.501
## y1 y2 y3 y4
## 2.032 2.032 2.030 2.031
## x1 x2 x3 x4
## 0.816 0.816 0.816 0.817
These summary statistics are quite similar…
a1 <- qplot(anscombe$x1, anscombe$y1)
a2 <- qplot(anscombe$x2, anscombe$y2)
a3 <- qplot(anscombe$x3, anscombe$y3)
a4 <- qplot(anscombe$x4, anscombe$y4)
gridExtra::grid.arrange(a1, a2, a3, a4)
… but the underlying data are quite different. Data visualization is your friend, and one of R’s strengths.
ggplot
versus “base R”I’m deliberately omitting discussion of so-called “base R” plotting—although it is frequently useful—in favor of emphasizing ggplot2
’s feature set. The analogy is slightly inapt, but think of ggplot2
as playing a similar role relative to base R graphics as seaborn
plays to matplotlib
.
I provided some discussion of base R plotting in 2017’s tutorial, so, again, check it out here.
ggplot()
callAlthough simple plots—scatter plots and histograms, mostly—can be generated with the qplot
function, most of the useful visualization features require wrapping your mind around ggplot
and its associated functions. The goal of ggplot2
is to implement a consistent grammar of graphics, and to that end most visualizations will have the same core elements:
This sounds intimidating at first, but is relatively straightforward in practice. We already have the data, so once we figure out our mappings and geoms, we can make some plots.
What about a histogram of Wilson’s 1912 vote share?
ggplot(data = sharks, mapping = aes(x = wilson1912)) +
geom_histogram()
Not so bad. Similarly to how, with matplotlib
, we might build up a plot with multiple method calls, here we chain function calls with the +
operator. So we can make the histogram slightly nicer:
ggplot(sharks, aes(x = wilson1912)) +
geom_histogram(binwidth = 0.01) +
theme_bw() +
labs(x = "Wilson 1912 Vote Share",
y = "Count",
title = "A Nicer Histogram")
geom
sWe can also use geom
s like geom_point
or geom_smooth
to plot a bivariate relationship, like that between Wilson’s 1912 and 1916 vote shares. How consistent are election returns from cycle to cycle?
ggplot(sharks, aes(x = wilson1912, y = wilson1916)) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw() +
labs(x = "Wilson 1912",
y = "Wilson 1916")
Here we are using two geom
s in one plot; there’s no limitation (except pragmatic ones) on the number you can use. So we used geom_point
to draw a scatter plot and then geom_smooth
to fit a straight line summarizing those data points (the additional parameter method = "lm"
indicates to use a linear model instead of the potentially nonlinear method used by default).
geom
s IIAnother way to look at this is to go county-by-county and compare.
ggplot(sharks, aes(y = county, color = (beach == 1))) +
geom_point(aes(x = wilson1912)) + # solid point
geom_point(aes(x = wilson1916), shape = 1) + # hollow point
labs(x = "Vote Share",
y = "County",
color = "Beach County?") +
theme_minimal()
ggplot(sharks, aes(y = county, color = (beach == 1))) +
geom_point(aes(x = wilson1912)) + # solid point
geom_point(aes(x = wilson1916), shape = 1) + # hollow point
labs(x = "Vote Share",
y = "County",
color = "Beach County?") +
theme_minimal()
Aside from showing how easy it is to build up useful visualizations, this also starts to give us some substantive insight: all of the beach counties saw either no change or a decrease in Wilson vote share; no beach county saw a meainingful increase in Wilson support. That’s interesting, at least, and some (weak) evidence for the claim that voters punished Wilson for the shark attacks.
What is a statistical model?
Most statistical modeling functions in R use some variant of formula syntax:
Y ~ X
Often, X
will include multiple variables of interest, and potentially transformations of those variables or interactions bewteen variables.
Y ~ X1 + X2 + X1:X2 + I(X1^2) # transformations are nested in I()
Y ~ X1*X2 + I(X1^2) # this expresses the same equation
Y ~ X1*X2 + I(X1^2) - 1 # as above, but drop the intercept term
So, applying this to our example, we might want to regress Wilson’s 1916 vote share on his 1912 vote share and see how much of the variance is explained by a simple linear model.
m <- lm(wilson1916 ~ wilson1912, data = sharks)
summary(m)
##
## Call:
## lm(formula = wilson1916 ~ wilson1912, data = sharks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.048114 -0.019104 -0.004053 0.023390 0.045968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04908 0.04151 1.182 0.252
## wilson1912 0.90838 0.09361 9.704 8.52e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02807 on 19 degrees of freedom
## Multiple R-squared: 0.8321, Adjusted R-squared: 0.8233
## F-statistic: 94.17 on 1 and 19 DF, p-value: 8.522e-09
stargazer(m, header = F, font.size = 'scriptsize')
broom::tidy(m) %>%
ggplot(aes(x = term)) +
geom_point(aes(y = estimate)) +
geom_errorbar(aes(ymin = estimate - 1.96 * std.error,
ymax = estimate + 1.96 * std.error),
width = 0.1) +
geom_hline(aes(yintercept = 0),
linetype = 'dashed') +
coord_flip() +
theme_minimal()
So, with all this in hand, we can start doing the sort of research that gets our name in Vox.
Ask yourself:
ab_sharks <- sharks[-7,]
dim(ab_sharks)
ab_model <- lm(wilson1916 ~ wilson1912 + machine + beach, data = ab_sharks)
summary(ab_model)
## [1] 20 8
##
## Call:
## lm(formula = wilson1916 ~ wilson1912 + machine + beach, data = ab_sharks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.033250 -0.006989 0.000657 0.005740 0.029520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.046093 0.027895 1.652 0.11794
## wilson1912 0.945336 0.061527 15.365 5.33e-11 ***
## machine -0.056383 0.010939 -5.154 9.60e-05 ***
## beach -0.032288 0.009882 -3.268 0.00484 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01702 on 16 degrees of freedom
## Multiple R-squared: 0.9459, Adjusted R-squared: 0.9357
## F-statistic: 93.21 on 3 and 16 DF, p-value: 2.397e-10
Shark attacks were a big story on Political Science Twitter in summer 2018.
The RStudio website has some terrific cheat sheets that I encourage everyone to bookmark (especially the data wrangling one, which I have to reference every time I use tidyr):
If you’re doing large-scale work in R, especially involving package development, here are some useful sources on development and R internals:
Some good resources on structuring and approaching a data analysis project:
Katya Ognyanova, one of David’s former postdocs, has a good introduction to network analysis in R with igraph
.
library(sf)
nj <- read_sf("~/Downloads/shape/New_Jersey_Counties.shp")
nj <- nj %>% left_join(sharks, by = c("COUNTY" = "county"))
p1 <- ggplot() +
geom_sf(data = nj, aes(fill = wilson1912)) +
coord_sf(ndiscr = 0) +
scale_fill_continuous() +
labs(title = "wilson1912") +
theme(legend.position = "none")
p2 <- ggplot() +
geom_sf(data = nj, aes(fill = wilson1916)) +
coord_sf(ndiscr = 0) +
scale_fill_continuous() +
labs(title = "wilson1916") +
theme(legend.position = "none")
p3 <- ggplot() +
geom_sf(data = nj, aes(fill = wilson1916 - wilson1912)) +
coord_sf(ndiscr = 0) +
scale_fill_gradient2(low = "firebrick",
mid = "lightgray",
high = "midnightblue") +
labs(title = "change") +
theme(legend.position = "none")
p4 <- ggplot() +
geom_sf(data = nj, aes(fill = as.factor(attack))) +
coord_sf(ndiscr = 0) +
scale_fill_discrete() +
labs(title = "attack") +
theme(legend.position = "none")
p5 <- ggplot() +
geom_sf(data = nj, aes(fill = as.factor(beach))) +
coord_sf(ndiscr = 0) +
scale_fill_discrete() +
labs(title = "beach") +
theme(legend.position = "none")
p7 <- ggplot() +
geom_sf(data = nj, aes(fill = as.factor(machine))) +
coord_sf(ndiscr = 0) +
scale_fill_discrete() +
labs(title = "machine") +
theme(legend.position = "none")
p8 <- ggplot() +
geom_sf(data = nj, aes(fill = as.factor(mayhew))) +
coord_sf(ndiscr = 0) +
scale_fill_discrete() +
labs(title = "mayhew") +
theme(legend.position = "none")
gridExtra::grid.arrange(p1,p2,p3,p4,
p5,p6,p7,p8,
ncol = 4)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
py_sharks = pd.read_csv("https://sdmccabe.github.io/r_tutorial_f19/resources/shark.csv")
py_ab_sharks = py_sharks.drop(py_sharks.index[6])
py_m = smf. \
ols("wilson1916 ~ wilson1912 + machine + beach", data = py_ab_sharks). \
fit()
print(py_m.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: wilson1916 R-squared: 0.946
## Model: OLS Adj. R-squared: 0.936
## Method: Least Squares F-statistic: 93.21
## Date: Wed, 04 Sep 2019 Prob (F-statistic): 2.40e-10
## Time: 21:52:04 Log-Likelihood: 55.320
## No. Observations: 20 AIC: -102.6
## Df Residuals: 16 BIC: -98.66
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 0.0461 0.028 1.652 0.118 -0.013 0.105
## wilson1912 0.9453 0.062 15.365 0.000 0.815 1.076
## machine -0.0564 0.011 -5.154 0.000 -0.080 -0.033
## beach -0.0323 0.010 -3.268 0.005 -0.053 -0.011
## ==============================================================================
## Omnibus: 0.266 Durbin-Watson: 2.007
## Prob(Omnibus): 0.875 Jarque-Bera (JB): 0.013
## Skew: 0.046 Prob(JB): 0.994
## Kurtosis: 2.917 Cond. No. 19.9
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
fig, ax = plt.subplots()
py_sharks.hist('wilson1912', bins = np.arange(0.3, 0.6, 0.01), ax = ax)
## array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f6a26611780>],
## dtype=object)
plt.show()
fig2, ax2 = plt.subplots()
sns.regplot(py_sharks['wilson1912'], py_sharks['wilson1916'], ax = ax2)
plt.show()
fig3, ax3 = plt.subplots()
sns.pointplot(y=py_sharks['county'], x=py_sharks['wilson1912'],
hue = py_sharks['beach'], linestyles=' ', markers = 'o')
sns.pointplot(y=py_sharks['county'], x=py_sharks['wilson1916'],
hue = py_sharks['beach'], linestyles=' ', markers = 'x')
## <matplotlib.axes._subplots.AxesSubplot object at 0x7f69f79dd5f8>
plt.show()
I guess that made Python look simpler than R…