County choropleth map using Python and D3
I’ve been working on mapping revenue data from natural resource extraction on federal lands. I wanted to get a sense of the variation in the data, so I would know if it was feasible to present the data geographically. Assuming it was feasible to present the data geographically, I also wanted to know which scale I would need to use to display the data in such a way as to convey sufficient variation in the data.
Python Pandas for data analysis
I’ve been using the Pandas library for Python for just this sort of thing. Not only can one quickly derive statistics from the data — such as mean, median, and standard deviation — but Pandas also provides tools for manipulating, organizing, and exporting data, among other things.
The source dataset I used had multiple revenue line items for each county, since it included different types of revenue from extraction on federal land (such as bonuses, rent, and royalties). I manually eliminated some columns in Excel (because that was easier than using Pandas), but I summed up the rate
(revenue) column for each county so there was only one line item for each (I forked the choropleth Observable Notebook from D3 creator Mike Bostock, and lazily retained the column header rate
that he used for unemployment rates…it’s a prototype, after all).
That relatively simple Python code looks like this:
import pandas as pd
file = r'revenue-test.csv'
#Pandas drops leading 0 of FIPS id without this
df = pd.read_csv(file, dtype={'id': 'str'})
#Sums up rate column when multiple entries exist for each id (county, in this case)
df2 = df.groupby(['id', 'state', 'county'])['rate'].sum()
df2 = df2.reset_index()
#Reads file to a new file, omits index from Pandas, and retains header
df2.to_csv('revenue-test-merged.csv', index=False, header=True)
That gives us our revenue file, with each county’s revenue (rate
) summed up and listed as one line item. I renamed the merged file and published it so I could access it from the Observable Notebook.
I also wanted to get a sense of the distribution of the data itself, especially after I initially used a quantize scale for the choropleth and found that the scale could not adequately capture the distribution of values.
Among other problems, this scale had the effect of visually equating (for example) $88 million and $574 million at the top end, and $52 and $12 million at the low end (not to mention negative values on the low end). There was just too much variation in the data for this type of scale.
Using Pandas, I soon discovered the reason why:
Median: $30,928.34
df['rate'].median()
Mean: $5,903,835.33
df['rate'].mean()
Standard deviation: $39,062,215.48
df['rate'].std()
As you can see, the median is only $30,928, while the mean is $5,903,835. There are clearly outliers on the top end of the data pulling on the mean. We can see from the standard deviation — which is nearly $40 million — there is significant variation in the data.
As a result, I explored using a quantile scale instead to preserve the variation in the data.
Quantile choropleth
I settled on a quantile scale that interpolates the values along a range, again, forked from Mike Bostock’s work.
This scale preserves the variations in the data with more fidelity, especially at the lower end of the distribution.
It needs more work, but it serves as a good prototype for evaluating the data.