Predicting King County Home Prices with Linear Regression
An in-depth walk-through of a linear regression model project
Overview
This article will walk through the process of deploying linear regression models in order to best predict home prices in King County, Washington. King County is the most populous county in the state of Washington, as well as the 12th most populous county in the United States. King County is also home to the city of Seattle, the state’s most populous city, with two-thirds of the county population living in the Seattle suburbs. The total estimated population of the county is 2,274,315 residents according to the 2020 US census.
The process for this analysis will follow the Cross Industry Standard Process for Data Mining (CRISP DM). The first steps in this process will be to get an understanding of the business objective and dataset. Next the data will be prepared for the modeling phase. Several models will then be tested and evaluated going through an iterative process in order to optimize model accuracy. For the purposes of this project, the accuracy of the models will be evaluated according to Root Mean Squared Error (RMSE), the Coefficient values, and R-squared as the main evaluation metrics.
Business Objective/Methods
Hypothetical Situation: A King County real estate agency is looking to provide accurate home value estimates to home owners who may be interested in listing their property. In order to accomplish this, the agency will need to know what factors are of most importance in predicting home value. In order to determine these factors, several multiple linear regression models will be created and will examine which independent variables are most helpful at predicting the dependent variable of home price. The regression models will help in providing home owners with relevant information regarding the price of their home should they be interested in listing their home with the real estate agency.
Data Description
The data used for this project will be the King County House Sales dataset, which can be found in kc_house_data.csv. This dataset contains 21,597 rows and 21 columns of data regarding home information in King County. There are several columns containing large amounts of missing data which will be dropped for this analysis.


Exploratory Analysis
The first step in the exploratory analysis will be to plot histograms of each feature in order to make sure the residuals are normally distributed.

Looking at the distributions we can see that this data contains many features that not evenly distributed. This is likely due to outliers in the data which will need to be removed in order to provide the models with more evenly distributed residuals. The data also contains several categorical features including: ‘floors’, ‘condition’, and ‘grade’. These will need to be converted into dummy variables before continuing to the modeling phase.
Next we will also need to check for multicollinearity between the features.

In order to optimize our models, highly correlated features will need to be separated in order to prevent multicollinearity within our models.
Model Training/Evaluation
Model A Training
For our first model, we will run a basic “vanilla” model in order to get a feel for how our data performs without removing any outliers or transforming the data in any way. This model’s features will be ‘bedrooms’, ‘bathrooms’, ‘sqft_living’, and ‘grade’ as our independent variables or our “X”, with ‘price’ being our dependent variable or our “Y”. Also before running our model it is important to split our data into training and testing sets. We will train our model using our training set and then test its performance with our testing set. The data will be split into 80% train and 20% test, with a random state of 42.
Model A Evaluation

After running Model A in Stats models OLS, we can see that p-values of each coefficient are all 0, meaning that there is a likely relationship between all the features and the dependent variable (price). Model A also produces an R-squared value of .546, meaning that about 55% of the model’s observed variance can be explained by the inputs (features). Model A also produces a high Skew of 3.045, indicating that the data is not evenly distributed. Model A also has a high Kurtosis of 30.845 indicating a heavy tail and the presence of outliers. The ranked features according to the coefficients for Model A are as follows: 1. Grade, 2. Bedrooms, 3. Bathrooms, 4. Square Foot Living. Model A’s Test RMSE of 248,719 means that this model on average predicts within $248,719 of the actual home value. Going forward we will look to significantly improve on this number.

Model B Training
For Model B we will run a similar model to Model A except this time we will scale our data and add dummy variables for categorical data. We will need to make sure all the data is on the same scale in order to improve our model’s performance. For example the feature “Square Foot Living” contains data in the 100,000s while features like “Bedrooms” and “Bathrooms” contain much lower numbered data by comparison, which can hurt model performance as seen in Model A. Categorical features will also need to be converted to dummy variables since a feature such as “Zip Code” for example would not have a continuous output but instead belong to a certain category (zip code).
Model B Evaluation

While Model B produces a better R-squared value of .668 than Model A there are also high p-values for some of zip code dummy variables, meaning that these high p-values are insignificant for the model and will need to be removed. Model B also produces a higher Skew than Model A at 4.983, indicating that the data is less evenly distributed and will need to be transformed. Model B also has a much higher Kurtosis of 69.100 indicating more outliers than Model A, this is likely due to the high p-value dummy variables that will need to removed.
The ranked features according to the coefficients for Model B are as follows: 1. Zip Code, 2. Square Foot Living, 3. Bathrooms, 4. Bedrooms, 5. Year Built. Model B’s Test RMSE of 215,931 is slightly improved compared to Model A, predicting within $215,931 of the actual home value. Going forward we will look to further improve on this number mainly by finally removing the outliers from our dataset.
Model C Training

For Model C we will remove outliers from “Bathrooms”, “Bedrooms”, and our dependent variable “Price”. We will do this by finding the upper and lower limit outliers from three standard deviations of the mean data and removing them. We will also log transform “Square Foot Lot” in order to get a more normal distribution. This should help in significantly improving our model’s performance. We will also test several different features from models A and B in order to further try and improve performance.
Model C Evaluation

Model C produces a higher R-squared than both models A and B at .76. Also the skew and kurtosis have both significantly decreased indicating more evenly distributed data from the removing of outliers and log transformation. There is also one zip code with a high p-value likely due to the presence of an outlier. The ranked features according to the coefficients for Model C are as follows: 1. Zip Code, 2. Square Foot Above, 3. Square Foot Lot, and 4. Bathrooms. Model C’s Test RMSE of 135,141 is significantly improved compared to both Models A and B, predicting within $135,141 of the actual home value. This is a good example how much removing outliers from our data can improve our model performance with linear regression. Going forward we will look to further improve on this by removing more outliers and engineering new features to try in our final model.
Model D Training
In creating our final model, Model D, we will further remove outliers from the “Square Foot Living”, “Square Foot Lot”, and “Square Foot Living 15” columns. We will also engineer a new feature consisting of the sum of “Sqft living and the average Sqft living of the nearest 15 neighbors. This could give a better idea of the home price due to the grouping of homes in neighborhoods with larger homes also in the area. Finally we will create a feature consisting of the distance from the homes to Seattle, as we would expect home values to increase as they become closer to the city.


Model D Evaluation

Model D produces a slightly higher R-squared than Model C at .805. Model D also has significantly lower skew and kurtosis than both Model’s A and B while being fairly similar to Model C. All of the p-values in Model D are 0 indicating that all of the features statistically significant. The ranked features according to the coefficients for Model D are as follows: 1. Zip Code, 2. Sum Square Foot Living, 3. Grade, 4. Year Built, 5. Bathrooms, and 6. Distance from Seattle. Model D’s Test RMSE of 111,021 is also improved compared to Model C, predicting within $111,021 of the actual home value.
Conclusion

Overall it appears that Zip Code is the most important factor in determining home value. This particular feature was ranked number one in Models B, C, and D. Square foot living, Square foot above, and Sum square foot living tend to be the second most important feature in determining the actual price of homes. In models B, C, and D these features produce the second ranked coefficients. These finding remain true as the models become more accurate and the data becomes more normally distributed across the different models. Bathrooms and Grade also appear to be fairly significant features in predicting price across the models. One interesting finding is that in model A we observed that Square foot living appeared to be the lowest coefficient, however this is likely due to the fact that the features were not scaled in this particular model, whereas in models B, C, and D all features were scaled before running the models.
For future work in order to further improve upon the model, we will likely need to further remove outliers as well as try different feature combinations. Another method could be to engineer more features from the dataset. Finally we could also test our model on more recent King Count home sales data to see if its performance remains as strong on new, comparable data.