Menu Close

House Prices: Making Predictions through linear regression

In this post I will explores the Ames Housing dataset was compiled by Dean De Cock for use in data science education. You may find the dataset hosted on Kaggle.com, as well as exploratory and data-modelling kernels hosted at this link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

Distribution of Sale Prices for the AMES Dataset

This dataset is a compilation of data related to Boston area home sales in the years 2007 through 2010. It includes the sale price, as well as information on the homes, such as location, size, number of bedrooms, and condition of the homes.

In this post we will:

  1. Explore the data
  2. Answer some questions regarding the data
  3. Model the data

In addition we want to answer the following real estate questions:

  1. What features are most important to determining the price of a home?
  2. Is there a particular time of the year a home can sell for more money? or less money?
  3. Can we accurately predict the price that a home will sell for with descriptive data on a home? Using a simple model?

Data exploration and visualization

Our target variable is the Sale Price, lets take a closer look at it:

count      1460.000000    mean     180921.195890    std       79442.502883    min       34900.000000    25%      129975.000000    50%      163000.000000 
75% 214000.000000 max 755000.000000

So what are the most important features that determine the price of the home? Lets start by exploring the relationships between our target variable ‘Sale Price’ and our features:

Above we have some of the scatter plots between our features and target variable. Going through our list of features we note some clear relationships and some clearly linear relationships (identifiable visually).

Clear relationships:

- LotFrontage
- LotArea
- MasVnrArea
- BsmtFinSF1
- BsmtFinSF2
- BsmtUnfSF
- 2ndFlrSF
- GarageYrBlt
- GarageArea

Linear relationships:

- YearBuilt
- YearRemodAdd
- TotalBsmtSF
- 1stFlrSF
- GrLivArea

We can explore these relationship quantitatively using a correlation matrix:

Correlation matrix with all our features and target variable. Lighter colours correspond to highly correlated features.

Visually, we can see that lighter coloured cells are highly correlated.

Some clear correlations with our target variable:

- OverallQual
- GrLivArea
- GarageCars
- GarageArea
- TotalBsmtSF
- 1stFlSF

We also see correlations within our features:

- GarageYrBuilt, YearBlt (intuitively we know that garages are usually built at the same time the home is built)
- TotRmsAbvGrd, GrLivArea (more rooms = more living area)

Now we can answer question #1:
What features are most important to determining the price of a home?

Our answer is:

  1. OverallQual
  2. GrLivArea
  3. GarageCars
  4. GarageArea
  5. TotalBsmtSF
  6. 1stFlSF

Lets look at question #2:
Is there a particular time of the year a home can sell for more money? or less money?

First we will look at when these homes are sold throughout the year:

We can clearly see that more homes are sold during the summer months, such as June and July. In June we have around 5 times the sales of homes than the months Jan and December.

We can also see that the number of homes sold, year to year, is fairly steady with no increasing or decreasing trends. The 2010 data is not complete and therefore does not represent an entire year of sales.

However, does that mean there is a price different on buying a home in the summer vs winter months?

Here we can see the range of house prices year to year and month to month. There is no increasing or decreasing trend in our year to year comparison, so we can feel comfortable directly comparing the months from any given year.

In our month to month comparison we can see the range of home sale prices does not change considerably. Therefore we can conclude there is no significant difference in the sale price of a home throughout the year.

Question 3: Can we accurately predict the price that a home will sell for with descriptive data on a home? Using a simple model?

Lets try fitting some simple linear models to this data, after accounting for null values and onehot encoding our categorical variables.

We will try sklearn Ridge linear regression package with various alpha values and see if the Root Mean Squared Error is in an acceptable range to make predictions.

We can see that we have good model performance using a simple regression model. In the case of using Ridge, we can set alpha = 0.5 for optimum performance. When running our model against the test set, we compute a score RMSE of 0.11, which is very good for a simple regression model.

For further details, please see the github repo of my work using the AMES data set: https://github.com/neilmistry/hpart

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: