Predicting gentrification using longitudinal census data

I collaborated on this report with Ken Steif (primary author) and Michael Fichman for Urban Spatial.

It was supported by Allan Mallach at the Center for Community Progress. Download the full report here.

The study has been featured in Next City and CityLab.


Machine Learning models have been used sparingly in community development despite their ubiquity in the private real estate industry. With this project, we demonstrate how predictive modeling can be used in service of community development goals. The report seeks to address the always important question of neighborhood change (gentrification and decline) in US cities. Change is often occurring too rapidly for policymakers and community groups to respond to. A new paradigm of data-oriented intelligence about neighborhood change could make it possible for these stakeholders to be proactive rather than reactive. The Urban Institute has called for such a trend, saying that "through these data systems, local leaders could adopt interventions that secure inclusion in dynamic neighborhoods."

This study uses longitudinal census data to forecast future changes in home values. Armed with accurate predictions about future real estate trends policymakers and advocates can answer resource allocation questions accordingly. These predictions could be motivators for the spatial distribution of low-income housing, historic preservation, or economic development efforts.


Data Visualization




For this project, we used a sample of 29 "Legacy Cities", which are defined as those that have (at least in parts) suffered from job loss, population decline or vacancy as a result of deindustrialization. These 29 cities share enough characteristics to be modeled together but are diverse enough to ensure that the resulting model will be generalizable to cities with a wide range of urban experiences. 

We built our models with Census Tract-level data from the Neighborhood Change Database. We were able to use tract-level median home value data from 1990, 2000, and 2010 to forecast home values in 2020. Though most real estate models make predictions at the property or parcel-level, the prevalence of Census data in the field of Community Development makes this resolution appropriate for the study.



The modeling process was underpinned by the theory of Endogenous Gentrification, which asserts that low income neighborhoods adjacent to wealthy ones are most likely to gentrify. This occurs because people make real estate decisions based on proximity to amenities. The implications of this theory for our modeling process are that we are able to gain significant predictive power about future neighborhood investment and disinvestment by mining the space-time trends in home values. We built endogenous variables that account for the spatial relationships among the home values within a tract and those within the tracts in its vicinity. For example, Spatial Lag variables account for the dynamics of a tract's neighbors. We used spatial statistical methods such as a Local Moran's I to build variables that account for the levels of variable clustering within a discrete space. These features were foundational to the model, providing immense predictive power.

We supplemented our endogenous variables with additional Census and geographic data from each city. We were able to gain additional insight by accounting for phenomena such as income levels, vacancy rates, education levels and the racial makeup of each tract. We accounted for the location of each tract within it's city by measuring it's proximity to the city's downtown (as approximated by City Hall). We ultimately developed our strongest predictions by determining the best combination of these features combined with a series of prediction algorithms. 



In the process of model selection we testing Random Forest and Gradient Boosting Machine (GBM) models, as well as a simple OLS model for comparison. We achieved relatively similar results with GBM and Random Forest models but ultimately settled on an ensemble method that stacked predictions from the two different model types. This ensemble model did not improve dramatically on the results of the Random Forest model in terms of overall predictive accuracy. However, it did prove to be more consistent over a series of random cross-validation samples. 

We evaluated our models by training them on data from 1990 and 2000, predicting for 2010 and validating according to observed 2010 data. Ultimately we achieved a respectable average of approximately 14% error overall and median error of roughly 8% at the city level. The inclusion of outlier neighborhoods led to some under-prediction in the larger cities within the dataset. These instances had a negative effect on the global predictive accuracy. However, we did see that error was seemingly randomly dispersed among the 29 cities indicating that the model was not likely overfit.

Finally, we input data from 2010 into our ensemble model to forecast tract-level home values in 2020. As I mentioned above, these predictions could be valuable resources for planners and policymakers when trying to manage dynamic urban conditions.