Week 11: Messy Data

This week we are going to do a short exercise on working with different kinds of data. The datasets we've used in our in-class exercises have mostly been pretty clean; we're going to use multiple representations of the same underlying data to see the different ways data can be represented while still conforming to a common file format.

The data this week is drawn from a well-known paper on the effect of shark attacks on the 1912 presidential election. The result---that Woodrow Wilson may have lost his home state of New Jersey because he was blamed for shark attacks that happened while he was governor of the state---has been controversial.

In [1]:
import csv
import json
from pprint import pprint as pretty_print

(1.1) We're going to be working with files a lot today. For this, we'll want a convenient way to peek at the contents of a file. Write a head() function that takes a filename and prints the first $n$ lines of the file.

In [2]:
def head(fname, n=5):
    """Print the first n lines of a file."""
    with open(fname, "r") as fin:
        for line in fin.readlines()[:n]:
            print(line.strip())

A quick digression on context managers. In previous weeks, we opened a file, did something with the contents of the file, and then had to remember to close the file. We can simplify this by using a context manager and the with keyword.

with open(filename, "r") as my_file:
    do_stuff_with_my_file()

Here we open a file (called filename) and assign it to a variable called my_file. This is equivalent to writing:

my_file = open(filename, "r")

The with keyword defines a new indented block - the context in which we are using the file. When we exit this block, either by returning to the previous level indentation, or by encountering an error, the context manager calls my_file.close() for us.

(1.2) Let's begin by peeking at sharks.csv. This is a pretty generic CSV.

In [3]:
head("bad_sharks/shark.csv")
county,wilson1912,wilson1916,beach,machine,mayhew,attack,coastal
"ATLANTIC, NJ",0.36045312881469727,0.36012208461761475,1,0,0,0,1
"BERGEN, NJ",0.42121848464012146,0.3835897445678711,0,1,0,0,1
"BURLINGTON, NJ",0.4130434989929199,0.4256618916988373,0,0,0,0,0
"CAMDEN, NJ",0.3939720094203949,0.4330543875694275,0,0,1,0,1

Parsing nightmares. You might notice that .strip().split() is going to not work here. Occasionally, there will be characters in the file that present a problem for parsing. For example, what if you have a CSV, but one of the values contains a comma? That would be troublesome. The common workaround for this is to wrap that variable in quotation marks. (If that field also tends to use quotation marks, then use single quotes. If it uses both, then, well, things get messy... thankfully, parsing csv files is something people have been doing with computer since the very beginning, and there is a package for that. You don't need to reinvent the wheel.

(1.3) Let's open this file into a list of dictionaries using csv.DictReader.

In [4]:
# Read in the .csv file (default)

with open("bad_sharks/shark.csv", "r") as fin:
    read = csv.DictReader(fin)
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])

Why did this work? This file conforms to the default csv format. More explicitly:

In [5]:
# Read in the .csv file with the default parameters made explicit

with open("bad_sharks/shark.csv", "r") as fin:
    read = csv.DictReader(fin, quotechar='"', fieldnames=None, delimiter=',')
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])

(2.0) Now we're going to load non-default files!

(2.1) A common variant on comma-separated value files are tab-separated value files (TSVs). While spaces are ' ', tabs are '\t'.

In [6]:
head("bad_sharks/shark.tsv")
county	wilson1912	wilson1916	beach	machine	mayhew	attack	coastal
ATLANTIC, NJ	0.36045312881469727	0.36012208461761475	1	0	0	0	1
BERGEN, NJ	0.42121848464012146	0.3835897445678711	0	1	0	0	1
BURLINGTON, NJ	0.4130434989929199	0.4256618916988373	0	0	0	0	0
CAMDEN, NJ	0.3939720094203949	0.4330543875694275	0	0	1	0	1
In [7]:
# Read in the file using the default parameters. Describe what goes wrong in a comment.
In [8]:
# Read in the file correctly by specifying the delimiter correctly. 

with open("bad_sharks/shark.tsv", "r") as fin:
    read = csv.DictReader(fin, quotechar='"', fieldnames=None, delimiter='\t')
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])

(2.2) Going back to a .csv, what if the field was wrapped in single quotes instead?

In [9]:
head("bad_sharks/shark_nj_single_quote.csv")
county,wilson1912,wilson1916,beach,machine,mayhew,attack,coastal
'ATLANTIC, NJ',0.36045312881469727,0.36012208461761475,1,0,0,0,1
'BERGEN, NJ',0.42121848464012146,0.3835897445678711,0,1,0,0,1
'BURLINGTON, NJ',0.4130434989929199,0.4256618916988373,0,0,0,0,0
'CAMDEN, NJ',0.3939720094203949,0.4330543875694275,0,0,1,0,1
In [10]:
# Read in the file using the default parameters. Describe what goes wrong in a comment.

with open("bad_sharks/shark_nj_single_quote.csv", "r") as fin:
    read = csv.DictReader(fin, quotechar="'", fieldnames=None, delimiter=',')
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])
In [11]:
# Read in the file correctly by specifying the quotechar correctly. 

with open("bad_sharks/shark_nj_single_quote.csv", "r") as fin:
    read = csv.DictReader(fin, quotechar="'", fieldnames=None, delimiter=',')
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])

(2.3) What if the commas are there, but the field isn't quoted at all?

In [12]:
head("bad_sharks/shark_nj_unquoted.csv")
county,wilson1912,wilson1916,beach,machine,mayhew,attack,coastal
ATLANTIC, NJ,0.36045312881469727,0.36012208461761475,1,0,0,0,1
BERGEN, NJ,0.42121848464012146,0.3835897445678711,0,1,0,0,1
BURLINGTON, NJ,0.4130434989929199,0.4256618916988373,0,0,0,0,0
CAMDEN, NJ,0.3939720094203949,0.4330543875694275,0,0,1,0,1
In [13]:
# Read in the file using the default parameters. Describe what goes wrong in a comment.

NOTE: Please don't spend time trying to implement a workaround to read it correctly. If you run into problems like this, get better data or go complain to your local data provider. (I've provided a solution at the bottom of the file.)

(2.4) It's not uncommon for a CSV to lack a header row.

In [14]:
head("bad_sharks/shark_no_header.csv")
"ATLANTIC, NJ",0.36045312881469727,0.36012208461761475,1,0,0,0,1
"BERGEN, NJ",0.42121848464012146,0.3835897445678711,0,1,0,0,1
"BURLINGTON, NJ",0.4130434989929199,0.4256618916988373,0,0,0,0,0
"CAMDEN, NJ",0.3939720094203949,0.4330543875694275,0,0,1,0,1
"CAPE MAY, NJ",0.4350515604019165,0.41938772797584534,1,0,0,0,1
In [15]:
# Read in the file using the default parameters. Describe what goes wrong in a comment.
In [16]:
# Read in the file correctly by specifying the fieldnames parameter. 

sharks_header = ['county', 'wilson1912', 'wilson1916', 'beach', 'machine', 'mayhew', 'attack', 'coastal']

with open("bad_sharks/shark_no_header.csv", "r") as fin:
    read = csv.DictReader(fin, quotechar='"', fieldnames=sharks_header, delimiter=',')
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])

(3.0) Another common format is JSON, as we saw a few weeks ago. When dealing with multiple "rows" of data in JSON, data is often structured one of two ways:

  1. as a list of key-value pairs, e.g., [{row 1}, {row 2}, {row 3}...]. Note the list-like brackets. There may be newlines separating the individual dictionaries, but the key here is the brackets.
  2. as a series of individual JSON objects, one per line. This form would not have the initial [ and would look something like
    {row 1}
    {row 2}
    {row 3}

(3.1) The first of these cases is what the json library is designed for.

In [17]:
head("bad_sharks/shark.json", n=11)
[
{
"county": "ATLANTIC, NJ",
"wilson1912": 0.3605,
"wilson1916": 0.3601,
"beach": 1,
"machine": 0,
"mayhew": 0,
"attack": 0,
"coastal": 1
},

You can use the following syntax inside of a with statement to load the file:

file_content = file_object.read()
df = json.loads(file_content)

In [18]:
# Read in the fully formatted .json file

with open("bad_sharks/shark.json", "r") as fin:
    df = json.load(fin)
pretty_print(df[0])
{'attack': 0,
 'beach': 1,
 'coastal': 1,
 'county': 'ATLANTIC, NJ',
 'machine': 0,
 'mayhew': 0,
 'wilson1912': 0.3605,
 'wilson1916': 0.3601}

(3.2) The second case is more common, but a bit more complicated.

In [19]:
head("bad_sharks/many_sharks.json")
{"county":"ATLANTIC, NJ","wilson1912":0.3605,"wilson1916":0.3601,"beach":1,"machine":0,"mayhew":0,"attack":0,"coastal":1}
{"county":"BERGEN, NJ","wilson1912":0.4212,"wilson1916":0.3836,"beach":0,"machine":1,"mayhew":0,"attack":0,"coastal":1}
{"county":"BURLINGTON, NJ","wilson1912":0.413,"wilson1916":0.4257,"beach":0,"machine":0,"mayhew":0,"attack":0,"coastal":0}
{"county":"CAMDEN, NJ","wilson1912":0.394,"wilson1916":0.4331,"beach":0,"machine":0,"mayhew":1,"attack":0,"coastal":1}
{"county":"CAPE MAY, NJ","wilson1912":0.4351,"wilson1916":0.4194,"beach":1,"machine":0,"mayhew":0,"attack":0,"coastal":1}

You will need to manually loop through the file and call json.loads() on each line.

In [20]:
# Read in the line-by-line formatted .json file

with open("bad_sharks/many_sharks.json", "r") as fin:
    df = []
    for line in fin.readlines():
        row = json.loads(line)
        df.append(row)
pretty_print(df[0])
{'attack': 0,
 'beach': 1,
 'coastal': 1,
 'county': 'ATLANTIC, NJ',
 'machine': 0,
 'mayhew': 0,
 'wilson1912': 0.3605,
 'wilson1916': 0.3601}

(4.0) Got extra time? Feel free to work on loading the data for your own project!

Appendix 1: Handling the unquoted-field issue

Hopefully you will not have to deal with data where fields should be quoted but are not. But if you do, you can perform a sort of surgery on the file, line by line, to get things properly quoted.

In [21]:
with open("bad_sharks/shark_nj_unquoted.csv", "r") as fin:
    ll = []
    
    # keep the header in its current form
    ll.append(fin.readline().strip())
    
    # for each line...
    for line in fin.readlines():
        
        # split on the delimiter
        line = line.strip().split(',')
        
        # merge the values that are being inappropriately split and quote-escape them
        line[0] = '"{},{}"'.format(line[0], line[1])
        
        # then remove the now-redundant value
        del line[1]
        
        # then put it all back together again
        line = ','.join(line)
        ll.append(line)
        
    # now we have a list of strings that can be fed into DictReader just like the original file
    read = csv.DictReader(ll, quotechar='"', fieldnames=None, delimiter=',')
    df = [line for line in read]
pretty_print(df[0])
OrderedDict([('county', 'ATLANTIC, NJ'),
             ('wilson1912', '0.36045312881469727'),
             ('wilson1916', '0.36012208461761475'),
             ('beach', '1'),
             ('machine', '0'),
             ('mayhew', '0'),
             ('attack', '0'),
             ('coastal', '1')])

Appendix 2: Doing all this in Pandas

Don't worry about this for now. Soon we'll learn about Pandas, and you might be interested in converting all this into a Pandas workflow.

In [22]:
import pandas as pd

All the code you've used so far will work fine with Pandas, because you can turn a list of dictionaries into a Pandas DataFrame really easily.

In [23]:
with open("bad_sharks/shark.csv", "r") as fin:
    read = csv.DictReader(fin)
    list_of_dicts = [line for line in read]
df = pd.DataFrame(list_of_dicts)
df
Out[23]:
county wilson1912 wilson1916 beach machine mayhew attack coastal
0 ATLANTIC, NJ 0.36045312881469727 0.36012208461761475 1 0 0 0 1
1 BERGEN, NJ 0.42121848464012146 0.3835897445678711 0 1 0 0 1
2 BURLINGTON, NJ 0.4130434989929199 0.4256618916988373 0 0 0 0 0
3 CAMDEN, NJ 0.3939720094203949 0.4330543875694275 0 0 1 0 1
4 CAPE MAY, NJ 0.4350515604019165 0.41938772797584534 1 0 0 0 1
5 CUMBERLAND, NJ 0.39153438806533813 0.44574469327926636 0 0 0 0 1
6 ESSEX, NJ 0.3417190611362457 0.38969072699546814 0 1 1 0 1
7 GLOUCESTER, NJ 0.4040084183216095 0.41201716661453247 0 0 0 0 1
8 HUDSON, NJ 0.5514705777168274 0.512269914150238 0 1 1 0 1
9 HUNTERDON, NJ 0.5443425178527832 0.5669371485710144 0 0 0 0 0
10 MERCER, NJ 0.3817427456378937 0.42769232392311096 0 0 1 0 0
11 MIDDLESEX, NJ 0.45501023530960083 0.4568527936935425 0 0 1 0 1
12 MONMOUTH, NJ 0.4959183633327484 0.47979798913002014 1 0 0 1 1
13 MORRIS, NJ 0.42033541202545166 0.4435318410396576 0 0 0 0 0
14 OCEAN, NJ 0.384458065032959 0.3801819980144501 1 0 0 1 1
15 PASSAIC, NJ 0.3881579041481018 0.415433406829834 0 0 1 0 0
16 SALEM, NJ 0.46351489424705505 0.45102041959762573 0 0 0 0 1
17 SOMERSET, NJ 0.43279021978378296 0.436804860830307 0 0 0 0 0
18 SUSSEX, NJ 0.5435004830360413 0.556690514087677 0 0 0 0 0
19 UNION, NJ 0.4117647111415863 0.3820459246635437 0 1 0 0 1
20 WARREN, NJ 0.5769633054733276 0.6191467046737671 0 0 0 0 0

But if you want to use Pandas's read_csv or read_json functions directly, the equivalent function calls are as follows:

In [24]:
df = pd.read_csv("bad_sharks/shark.csv")
df = pd.read_csv("bad_sharks/shark.tsv", sep='\t')
df = pd.read_csv("bad_sharks/shark_nj_single_quote.csv", quotechar="'")
df = pd.read_csv("bad_sharks/shark_no_header.csv", header=None, names=['county', 'wilson1912', 'wilson1916', 'beach', 'machine', 'mayhew', 'attack', 'coastal'])
df = pd.read_json("bad_sharks/shark.json")
df = pd.read_json("bad_sharks/many_sharks.json", lines=True)

Note that pd.read_csv() is going to have the same problem with improperly unquoted fields as csv.DictReader().

In [25]:
df = pd.read_csv("bad_sharks/shark_nj_unquoted.csv")
In [26]:
df
Out[26]:
county wilson1912 wilson1916 beach machine mayhew attack coastal
ATLANTIC NJ 0.360453 0.360122 1 0 0 0 1
BERGEN NJ 0.421218 0.383590 0 1 0 0 1
BURLINGTON NJ 0.413043 0.425662 0 0 0 0 0
CAMDEN NJ 0.393972 0.433054 0 0 1 0 1
CAPE MAY NJ 0.435052 0.419388 1 0 0 0 1
CUMBERLAND NJ 0.391534 0.445745 0 0 0 0 1
ESSEX NJ 0.341719 0.389691 0 1 1 0 1
GLOUCESTER NJ 0.404008 0.412017 0 0 0 0 1
HUDSON NJ 0.551471 0.512270 0 1 1 0 1
HUNTERDON NJ 0.544343 0.566937 0 0 0 0 0
MERCER NJ 0.381743 0.427692 0 0 1 0 0
MIDDLESEX NJ 0.455010 0.456853 0 0 1 0 1
MONMOUTH NJ 0.495918 0.479798 1 0 0 1 1
MORRIS NJ 0.420335 0.443532 0 0 0 0 0
OCEAN NJ 0.384458 0.380182 1 0 0 1 1
PASSAIC NJ 0.388158 0.415433 0 0 1 0 0
SALEM NJ 0.463515 0.451020 0 0 0 0 1
SOMERSET NJ 0.432790 0.436805 0 0 0 0 0
SUSSEX NJ 0.543500 0.556691 0 0 0 0 0
UNION NJ 0.411765 0.382046 0 1 0 0 1
WARREN NJ 0.576963 0.619147 0 0 0 0 0
In [ ]: