New Jersey County Zillow Analysis

Walk-through of a real estate time-series modeling project

Justin Giovatto
7 min readJun 21, 2021

Overview

This article will walk through the process of analyzing a time series dataset, as well as building SARIMA models in order to make future predictions on the data. Specifically, this project will feature New Jersey real estate data from Zillow and will follow the CRISP DM process. We will first go through and analyze the data at the state and county level in order to get a full understanding of the data’s past performance over the given time period. The data will then be cleaned and prepared for the modeling phase. Next SARIMA models will be run on the data and results will be analyzed. The main evaluation metrics for this project will be Forecasted Return on Investment and Root Mean Squared Error (RMSE). Finally after model results have been analyzed, business recommendations will then be presented.

Business Objective/Methods

Hypothetical Situation: A real estate investment firm is interested in knowing the top three New Jersey counties to invest in based on return on investment (ROI). The investment firm is also interested in knowing the past ROI returns dating back to 1996, in order to compare past ROI data with future forecasted ROI predictions. The model type used for predicting future ROI will be a SARIMA model. The SARIMA model type was chosen because it takes into account the seasonal nature of the real estate market, as well as trends, and past data in making its forecasted predictions. The data will then be broken out by county for analysis. A stepwise fit will then be used to find optimal model orders in order to minimize AIC and produce the most accurate model based on the data. The SARIMA models will then be fitted and will forecast five years into the future. The models will then be analyzed by RMSE as an evaluation metric. Finally the top three counties will then be recommended based on these methods, as well as the main counties to avoid, and the county with the lowest initial investment.

Data Description

The data used will be a Zillow dataset ranging from April 1st, 1996, through April 1st, 2018. The dataset can be found at, https://www.zillow.com/research/data/. The data contains home value information throughout the United States consisting of 14,723 columns and 272 rows. For the purposes of this project the data will be filtered to only include New Jersey data. The New Jersey data will then be separated out by all 21 New Jersey counties in order to model each county for forecasted ROI. The data will also be grouped by month for a mean value of homes for each month. Before beginning, we will first need to convert the data into datetime objects in order to prepare it for time series modeling. The data will also need to be converted from wide to long format in order to make it easier to work with and interpret, as seen below.

Exploratory Analysis

Looking through the data we can see that there appear to be no missing values in the dataset. One of the first things we’ll do to get a better feel for the data will be to plot the overall mean home value for the entire state as seen below.

Looking at this graph we can see that overall New Jersey home values appear to have sharply increased from 1996 through about 2006 before significantly dropping in value from 2008 to around 2012. This is likely due to the housing crisis and recession starting in 2008. Since around 2012 home values appear to be steadily increasing across the state as the economy began to recover during this time period. The data also does not appear to be stationary as it follows a general upward trend. In order to confirm that the data is not stationary we will run an ADFuller Test as seen below. Because the p-value from this test is over 0.05%, at 2.55%, we can accept the null hypothesis that the data is not stationary.

Further exploring the data and getting into past ROI analysis we will create a function that returns a dictionary of the past ROI from 1996 through 2018. Once we have this dictionary, we will convert it into a data frame and sort from highest to lowest. Doing this we can see that the top three counties by ROI over this time period were Hudson County at 315% ROI, Cape May County at 261% ROI, and finally Monmouth County at 176% ROI. Going forward, it will be interesting to see if these counties have less room to grow due to these sharp ROI increases and have begun to level off. It will also be interesting to check if the lowest performing counties, turn out to have ROI going forward since they have more room to grow.

SARIMA Model Results

After running our SARIMA models, we can see that the top ROI forecasted NJ county is Salem County with a predicted increase of 121% over the next 5 years. The average initial cost for the county in 2018 is just over 150,000 dollars. While the predicted average value in 2023 is just under 350,000 dollars. The model used predicts average home values within 6,334 dollars of the actual average value, according to the test data RMSE. The model is confident that 95% of predictions will fall within the upper and lower confidence intervals, as seen in the graph below.

The 2nd highest ROI forecasted county is Sussex with a predicted increase of 64% over the next 5 years. The average initial cost for the county in 2018 is just under 250,000 dollars. While the predicted average value in 2023 is just under 400,000 dollars. The model used predicts average home values within 1,738 dollars of the actual average value, according to the test data RMSE. The model is confident that 95% of predictions will fall within the upper and lower confidence intervals seen in the graph below,

The 3rd highest ROI forecasted county is Passaic with a predicted increase of 55% over the next 5 years. The average initial cost for the county in 2018 is about 300,000 dollars. While the predicted average value in 2023 is about 475,000 dollars. The model used predicts average home values within 9,767 dollars of the actual average value, according to the test data RMSE. The model is confident that 95% of predictions will fall within the upper and lower confidence intervals as seen in the graph below,

Business Recommendations

Overall business recommendations are as follows:

1. Invest in Salem, Sussex, and Passaic County for the highest forecasted ROI returns.

2. The best investment appears to be Salem County due to its lowest initial cost of just over 150,000 dollars along with the overall highest forecasted ROI of 121%, almost double the next highest forecast.

3. Would recommend avoiding past top ROI counties including Hudson, Cape May, and Monmouth County. Despite the large past ROI data from these counties they may have leveled off in terms of value and likely will not be good investment options going forward.

Future Work

Going forward future work on this project could include:

1. Add another model type such as a RNN in order to compare its forecasts with the SARIMA models used here and check for similarities/differences between the models.

2. Go deeper into the analysis by modeling the top counties by their cities in order to determine the top ROI cities within the counties thereby providing even more targeted recommendations with regard to where to invest.

3. Take into account another evaluation metric along with ROI such as risk in order to provide more detailed investment recommendations.

--

--