I collaborated on this tutorial with Ken Steif using data from the San Francisco Office of the Assessor-Recorder. You can find Ken's writeup of the tutorial on his site and download all of the code for the project on my GitHub.
You don't need me to tell you that R is an indispensable tool for wrangling and visualizing large, sophisticated datasets. The tidyverse packages in R contain unparalleled power to display complex trends with beautiful plots.
If you've read anything else on this website you've probably figured out that I am naturally drawn to spatial data. With a background in GIS, I have spent a lot of time the past year exploring R's capabilities for spatial analysis. It turns out they are extensive. The ability to incorporate mapping and spatial analysis into my R data visualization workflow has been a revelation.
I worked with Ken Steif (who shares my interest in spatial data) to put together this tutorial. We incorporate maps into a set of data visualizations to explore time-space trends in San Francisco's residential real estate market. We have made all of our code and data available so you can recreate this on your own.
Here's another thing you don't need me to tell you: housing is really expensive in San Francisco. Not only is it really expensive but is getting more expensive quickly. We wanted to visualize this trend and see how it has varied from neighborhood to neighborhood.
We used a dataset of 17,527 home sale transactions from 2009 to 2015. For the purposes of this tutorial we cleaned the data and joined each sale to a neighborhood. To begin, open up a new R script, set your workspace and a couple selection options and install/load the following libraries
To begin: open up a new workspace, set your working directory and run the following code. It will run through all of the necessary processes before we begin making visualizations: set global options, load packages, define plot and map themes, establish color ramps and read in data from GitHub.
Now we're ready to start building plots. First we will look at at simple histogram of sale prices in San Francisco for the whole study period.
After we remove some outliers we will plot the distribution of sale prices by year using violin plots.
Ok, we can see that prices have been increasing every year. Now we want to map them. Before we map any home sales we define and download a basemap using the ggmap package. This package allows you to download maps from several different sources. For this project we have decided on one from Stamen.
Let's plot some home sale points. We will use small multiple maps to plot sale prices by year.
You can see how the highest priced home sales have gradually expanded from clustering primarily around downtown in 2009 to most of the city in 2015. To get a more granular perspective we will look at the trend in just one neighborhood: The Mission District.
It looks like there has been some significant changes in this neighborhood in the last few years. These point maps are interesting but we really want to look at every neighborhood in the city. To do this we will need to do some data-wrangling in order to summarize each of these points by neighborhood.
Now we have a tidy dataset that we can use to plot neighborhood polygons. Let's map neighborhood median home price over time.
Then percentage change over this time period:
We can start to see some dynamics emerging. The high end areas around The Marina, Russian Hill and Nob Hill did not change as dramatically as the southern neighborhoods. Now we will examine trends in the highest appreciating neighborhoods by building some time series plots.
This is interesting but we would like to add some geographical context. Let's create a locator map and arrange it on the page with our time series plot.
Let's try to make one more connection within these data. Let's plot percent change in sale price as a function of initial, 2009 prices.
For the most part, lower priced neighborhoods in 2009 appreciated most rapidly over the course of the study period.