In this post I will explores the Ames Housing dataset was compiled by Dean De Cock for use in data science education. You may find the dataset hosted on Kaggle.com, as well as exploratory and data-modelling kernels hosted at this link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

This dataset is a compilation of data related to Boston area home sales in the years 2007 through 2010. It includes the sale price, as well as information on the homes, such as location, size, number of bedrooms, and condition of the homes.
In this post we will:
- Explore the data
- Answer some questions regarding the data
- Model the data
In addition we want to answer the following real estate questions:
- What features are most important to determining the price of a home?
- Is there a particular time of the year a home can sell for more money? or less money?
- Can we accurately predict the price that a home will sell for with descriptive data on a home? Using a simple model?
Data exploration and visualization
Our target variable is the Sale Price, lets take a closer look at it:
count 1460.000000 mean 180921.195890 std 79442.502883 min 34900.000000 25% 129975.000000 50% 163000.000000
75% 214000.000000 max 755000.000000
So what are the most important features that determine the price of the home? Lets start by exploring the relationships between our target variable ‘Sale Price’ and our features:
Above we have some of the scatter plots between our features and target variable. Going through our list of features we note some clear relationships and some clearly linear relationships (identifiable visually).
Clear relationships:
- LotFrontage
- LotArea
- MasVnrArea
- BsmtFinSF1
- BsmtFinSF2
- BsmtUnfSF
- 2ndFlrSF
- GarageYrBlt
- GarageArea
Linear relationships:
- YearBuilt
- YearRemodAdd
- TotalBsmtSF
- 1stFlrSF
- GrLivArea
We can explore these relationship quantitatively using a correlation matrix:

Visually, we can see that lighter coloured cells are highly correlated.
Some clear correlations with our target variable:
- OverallQual
- GrLivArea
- GarageCars
- GarageArea
- TotalBsmtSF
- 1stFlSF
We also see correlations within our features:
- GarageYrBuilt, YearBlt (intuitively we know that garages are usually built at the same time the home is built)
- TotRmsAbvGrd, GrLivArea (more rooms = more living area)
Now we can answer question #1:
What features are most important to determining the price of a home?
Our answer is:
- OverallQual
- GrLivArea
- GarageCars
- GarageArea
- TotalBsmtSF
- 1stFlSF
Lets look at question #2:
Is there a particular time of the year a home can sell for more money? or less money?
First we will look at when these homes are sold throughout the year:
We can clearly see that more homes are sold during the summer months, such as June and July. In June we have around 5 times the sales of homes than the months Jan and December.
We can also see that the number of homes sold, year to year, is fairly steady with no increasing or decreasing trends. The 2010 data is not complete and therefore does not represent an entire year of sales.
However, does that mean there is a price different on buying a home in the summer vs winter months?
Here we can see the range of house prices year to year and month to month. There is no increasing or decreasing trend in our year to year comparison, so we can feel comfortable directly comparing the months from any given year.
In our month to month comparison we can see the range of home sale prices does not change considerably. Therefore we can conclude there is no significant difference in the sale price of a home throughout the year.
Question 3: Can we accurately predict the price that a home will sell for with descriptive data on a home? Using a simple model?
Lets try fitting some simple linear models to this data, after accounting for null values and onehot encoding our categorical variables.
We will try sklearn Ridge linear regression package with various alpha values and see if the Root Mean Squared Error is in an acceptable range to make predictions.

We can see that we have good model performance using a simple regression model. In the case of using Ridge, we can set alpha = 0.5 for optimum performance. When running our model against the test set, we compute a score RMSE of 0.11, which is very good for a simple regression model.
For further details, please see the github repo of my work using the AMES data set: https://github.com/neilmistry/hpart