Predicting Boston Housing Prices

Model Evaluation & Validation

John Kinstler

Project files at https://github.com/m00nd00r/Boston-Housing

The dataset can be found here, which is provided by the UCI Machine Learning Repository. This project evaluates several machine learning algorithms for their performance in predicting housing prices for data taken from the Boston Area.

Install

This project requires Python 2.7 and the following Python libraries installed:

You will also need to have software installed to run and execute a Jupyter Notebook

If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included. Make sure that you select the Python 2.7 installer and not the Python 3.x installer.

Getting Started

This project evaluates the performance and predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts. A model trained on this data that is seen as a good fit could then be used to make certain predictions about a home — in particular, its monetary value.

The dataset for this project originates from the UCI Machine Learning Repository. The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts. For the purposes of this project, the following preprocessing steps have been made to the dataset:

  • 16 data points have an 'MEDV' value of 50.0. These data points likely contain missing or censored values and have been removed.
  • 1 data point has an 'RM' value of 8.78. This data point can be considered an outlier and has been removed.
  • The features 'RM', 'LSTAT', 'PTRATIO', and 'MEDV' are essential. The remaining non-relevant features have been excluded.
  • The feature 'MEDV' has been multiplicatively scaled to account for 35 years of market inflation.

Run the code cell below to load the Boston housing dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit

# Import supplementary visualizations code visuals.py
import visuals as vs
    
# Pretty display for notebooks
%matplotlib inline

# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
    
# Success
print "Boston housing dataset has {} data points with {} variables each.".format(*data.shape)
Boston housing dataset has 489 data points with 4 variables each.

Data Exploration

In this first section of the project, I will make a cursory investigation of the Boston housing data and provide a few observations.

Since the main goal of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The features, 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict. These are stored in features and prices, respectively. Implementation: Calculate Statistics

In the code cell below, I calculate the minimum, maximum, mean, median, and standard deviation of 'MEDV', which is stored in prices and store each calculation in their respective variable.

In [2]:
# Minimum housing value in the dataset
minimum_price = np.min(prices)

# Maximum housing value in the dataset
maximum_price = np.max(prices)

# Mean house value of the dataset
mean_price = np.mean(prices)

# Median house value of the dataset
median_price = np.median(prices)

# Standard deviation of housing values of the dataset
std_price = np.std(prices)

# Show the calculated statistics
print "Boston Housing dataset statistics:\n"
print "Minimum house price:", minimum_price
print "Maximum house price:", maximum_price
print "Mean house price: {0:.3f}".format(mean_price)
print "Median house price:", median_price
print "Standard deviation of house price: {0:.3f}".format(std_price)
Boston Housing dataset statistics:

Minimum house price: 105000.0
Maximum house price: 1024800.0
Mean house price: 454342.945
Median house price: 438900.0
Standard deviation of house price: 165171.132

Feature Analysis

I am using three features from the Boston housing dataset: 'RM', 'LSTAT', and 'PTRATIO'. For each data point (neighborhood):

  • 'RM' is the average number of rooms among homes in the neighborhood.
  • 'LSTAT' is the percentage of homeowners in the neighborhood considered "lower class" (working poor).
  • 'PTRATIO' is the ratio of students to teachers in primary and secondary schools in the neighborhood.

With a rudimentary knowledge of housing market, I would guess that an increase in 'RM' would lead to an increase in 'MEDV'. Likewise, an increase in 'LSTAT' would lead to a decrease in 'MEDV'. Finally, an increase in 'PTRATIO' would likely see an increase in 'MEDV'.

Developing a Model

Implementation: Define a Performance Metric

It is difficult to measure the quality of a given model without quantifying its performance over training and testing. This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement. For this project, I will be calculating the coefficient of determination, R2, to quantify the model's performance. The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how "good" that model is at making predictions.

The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 is no better than a model that always predicts the mean of the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features. A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable.

For the performance_metric function in the code cell below, I will implement the following:

  • Use r2_score from sklearn.metrics to perform a performance calculation between y_true and y_predict.
  • Assign the performance score to the score variable.
In [3]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    
    # TODO: Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)
    
    # Return the score
    return score

Goodness of Fit

Let's assume that a dataset contains five data points and a model made the following predictions for the target variable:

True Value Prediction
3.0 2.5
-0.5 0.0
2.0 2.1
7.0 7.8
4.2 5.3

Would this model have successfully captured the variation of the target variable? I would guess yes, because the predicted values appear to be pretty close to the true values.

The code cell below uses the performance_metric function and calculates this model's coefficient of determination.

In [4]:
# Calculate the performance of this model
score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
print "Model has a coefficient of determination, R^2, of {:.3f}.".format(score)
Model has a coefficient of determination, R^2, of 0.923.

As you can 0.923 is pretty close to 1.0, therefore, these predicted values come from a model that would be a good predicter of the actual values.

Implementation: Shuffle and Split the Data

The next implementation requires the Boston housing dataset be split into training and testing subsets. Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset.

For the code cell below, I will do the following:

  • Use train_test_split from sklearn.cross_validation to shuffle and split the features and prices data into training and testing sets.
  • Split the data into 80% training and 20% testing.
  • Set the random_state for train_test_split. This ensures results are consistent.
  • Assign the train and testing splits to X_train, X_test, y_train, and y_test.
In [5]:
from sklearn.model_selection import train_test_split

# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size = 0.2, random_state = 0)

# Success
print "Training and testing split was successful."
Training and testing split was successful.

I'm splitting the data so that I can evaluate whether my learner is doing a good job at predicting the outcome. To do this I need to save off a portion of the data that the learner hasn't been modelled to yet. This test data will be used to evaluate the performance of the learner that was just created to determine if any adjustments need to be made to improve performance.

This is especially important in regards to preventing over-fitting the learner to the data. This happens when an learner does too good a job at fitting to a dataset; so good that it does poorly when provided with data that falls outside the statistics of the data that it trained on. In this case, new data outside the statistics of the training data will yield much higher prediction errors.

To avoid this I'll split the datasets to compare the training error to the testing error so that I can find the best balance between minimizing the difference between the errors as well as minimizing the overall value of the errors.

Analyze Model Performance

In this third section of the project, I'll take a look at a Decision Tree Regressor's learning and testing performances on various subsets of training data. Additionally, I'll investigate one particular algorithm with an increasing max_depth parameter on the full training set to observe how model complexity affects performance. Graphing the model's performance based on varying criteria can be beneficial in the analysis process, such as visualizing behavior that may not have been apparent from the results alone.

Learning Curves

The following code cell produces four graphs for a decision tree model with different maximum depths. Each graph visualizes the learning curves of the model for both training and testing as the size of the training set is increased. Note that the shaded region of a learning curve denotes the uncertainty of that curve (measured as the standard deviation). The model is scored on both the training and testing sets using R2, the coefficient of determination.

In [6]:
# Produce learning curves for varying training set sizes and maximum depths
vs.ModelLearning(X_train, y_train)

Learning the Data

Each graph above shows the number of training points used along the x-axis with the score for the model with that number of training points on the y-axis using the R2 coefficient of determination metric to evaluate it's performance.

In the first graph we see that the training and testing sets for a Decision Tree with a depth of 1 branch converge to an R2 of somewhere between 0.4 and 0.5. We also see that the training and testing models start converge around 50 data points for the training set. They get closest around 300 data points with a score of around 0.45. Because they converged so quickly, this tells us that there the model has high bias in that there isn't enough complexity in it to account for all of variation in the data, thus yielding a low score.

The graph to it's right shows the training and testing sets for a Decision tree with a maximum depth of 3 branches. Here we can see that the training and testing sets converge a bit more slowly, getting closest again near 300. However, the score is much higher at around 0.8. Also beyond this number of training points, we don't see much improvement. This appears to be a pretty good model in that there's enough complexity in the model to fit the data pretty closely, but not so much complexity that it can't account for large enough percentage of the data to yield a very score. This looks to be our best model so far.

Looking at the lower left graph for a Decision Tree with 6 branches we can see that the training data has a relatively high score near 0.9, but that the testing data seems to reach a limit at around 0.8. That lack of convergence indicates a high variance in the model in that it's too complex to account of data outside it's training data, yielding much higher discrepancies in it's training versus testing. This means it won't predict very well on new data.

We can see that the high variance problem only gets worth with more branches in the final graph on the lower right.

So far, my best guess for the Decision Tree Regressor is to use 3 branches for an optimal model.

Complexity Curves

The following code cell produces a graph for a decision tree model that has been trained and validated on the training data using different maximum depths. The graph produces two complexity curves — one for training and one for validation. Similar to the learning curves, the shaded regions of both the complexity curves denote the uncertainty in those curves, and the model is scored on both the training and validation sets using the performance_metric function.

In [7]:
vs.ModelComplexity(X_train, y_train)

In the above graph we can see a validation of the Decision Tree Regressor with 4 branches as the optimal max_depth for this data set. The training and testing score curves start diverge right at a maximum depth of 4 branches.

We can also see that at lower max depths there is lower performance and convergence between the train and test sets indicating high bias. However, as the depth approaches 3 we see the performance in increasing score start slowing. After 4, with increasing complexity in the model we see the test scores start to decrease and get further away from the scores for the train scores. Beyond 4 branches we see the problem with high variance from too much model complexity where the model is over-fitting the training data and underperforming on the testing data.

Evaluating Model Performance

In this final section of the project, I will construct a model and make a prediction on the client's feature set using an optimized model from fit_model.

For many of the models that are typically used for learning different datasets they have parameters that are internal to the models themselves that aren't directly learnt within the estimators from the datasets. Some of these models can have several hyper-parameters all of which need to be independently set to affect the performance of the model.

We'll need a methodical approach to tuning these hyper-parameters to get the best possible performance out of the learning model.

For this methodology I will use GridSearchCV along with k-folds cross-validation to scan across various ranges of hyper-parameters with varying training and testing set splits to find the best hyper-parameter settings to get the best results.

When evaluating different settings (“hyperparameters”) for estimators, such as the max_depth for the Decision Tree Regressor, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:

  • A model is trained using k-1 of the folds as training data;
  • the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute the performance measure R2).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data.

In [8]:
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV

def fit_model(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0)

    # Create a decision tree regressor object
    regressor = DecisionTreeRegressor()

    # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
    max_depth = np.arange(1,11)
    params = {'max_depth': max_depth}

    # Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric)

    # Create the grid search object
    grid = GridSearchCV(regressor, param_grid = params, cv = cv, scoring = scoring_fnc)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid.best_estimator_

Making Predictions

Once a model has been trained on a given set of data, it can now be used to make predictions on new sets of input data. In the case of a decision tree regressor, the model has learned what the best questions to ask about the input data are, and can respond with a prediction for the target variable. We can use these predictions to gain information about data where the value of the target variable is unknown — such as data the model was not trained on.

Optimal Model

In [9]:
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)

# Produce the value for 'max_depth'
print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth'])
Parameter 'max_depth' is 4 for the optimal model.

Predicting Selling Prices

Imagine that you were a real estate agent in the Boston area looking to use this model to help price homes owned by your clients that they wish to sell. You have collected the following information from three of your clients:

Feature Client 1 Client 2 Client 3
Total number of rooms in home 5 rooms 4 rooms 8 rooms
Neighborhood poverty level (as %) 17% 32% 3%
Student-teacher ratio of nearby schools 15-to-1 22-to-1 12-to-1

What price would you recommend each client sell his/her home at? Do these prices seem reasonable given the values for the respective features?

In [10]:
pred = reg.predict(X_test)
score = performance_metric(pred,y_test)
print "R^2 score for test data set: {:.4f}".format(score)
R^2 score for test data set: 0.7446
In [11]:
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
               [4, 32, 22], # Client 2
               [8, 3, 12]]  # Client 3

# Show predictions
pred = reg.predict(client_data)
for i, price in enumerate(pred):
    #print performance_metric(pred[i])
    print "Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price)
Predicted selling price for Client 1's home: $391,183.33
Predicted selling price for Client 2's home: $189,123.53
Predicted selling price for Client 3's home: $942,666.67

From the test data above, we can see that the house with the most number of rooms, the least amount of surrounding poverty and the lowest student-teacher ratio should have the highest price, which it does.

Conversely, the house with the least rooms, highest neighborhood poverty and highest student-teacher ration should have the lowest price, again confirmed.

The above results match our training data intuitions quite well, and confirms that relatively speaking we can expect useful results from this predictor.

Sensitivity

An optimal model is not necessarily a robust model. Sometimes, a model is either too complex or too simple to sufficiently generalize to new data. Sometimes, a model could use a learning algorithm that is not appropriate for the structure of the data given. Other times, the data itself could be too noisy or contain too few samples to allow a model to adequately capture the target variable — i.e., the model is underfitted. The code cell below will run the fit_model function ten times with different training and testing sets to see how the prediction for a specific client changes with the data it's trained on.

In [12]:
vs.PredictTrials(features, prices, fit_model, client_data)
Trial 1: $391,183.33
Trial 2: $419,700.00
Trial 3: $415,800.00
Trial 4: $420,622.22
Trial 5: $418,377.27
Trial 6: $411,931.58
Trial 7: $399,663.16
Trial 8: $407,232.00
Trial 9: $351,577.61
Trial 10: $413,700.00

Range in prices for client 1: $69,044.61

Trial 1: $189,123.53
Trial 2: $287,100.00
Trial 3: $236,478.26
Trial 4: $235,122.22
Trial 5: $227,460.00
Trial 6: $235,620.00
Trial 7: $238,132.08
Trial 8: $229,200.00
Trial 9: $243,857.89
Trial 10: $228,385.71

Range in prices for client 2: $97,976.47

Trial 1: $942,666.67
Trial 2: $927,500.00
Trial 3: $888,720.00
Trial 4: $896,280.00
Trial 5: $854,700.00
Trial 6: $922,740.00
Trial 7: $896,962.50
Trial 8: $979,300.00
Trial 9: $904,718.18
Trial 10: $941,220.00

Range in prices for client 3: $124,600.00

Applicability

  • How relevant today is data that was collected from 1978?
  • Are the features present in the data sufficient to describe a home?
  • Is the model robust enough to make consistent predictions?
  • Would data collected in an urban city like Boston be applicable in a rural city?

Questions like these are highly relevant to doing this type of analysis. Housing markets are extremely flexible and subject to quite a lot of variation over time that may not reflect underlying inflation in the broader economy, which is the assumption that was used to change the MEDVprices to reflect today's housing prices in the Boston area. So it's very likely that this data wouldn't be very robust to today's housing market in the Boston area.

Addionally, the are almost certainly other features of a house that should intuitively be more predictive of a house's value like square footage, number of bathrooms, proximity to waterfronts, number of levels, etc.

However, given those limitations, this does appear to give us a good process by which we can find good predictions with a bit more data that has more features that was more recently collected.

Finally, housing markets vary in the US quite a lot from region to region. So we should expect our model to only be predictive in the region that we modelled on.