Introduction to linear regression.¶
In this jupyter notebook we will start with a very simple problem of predicting the height of the user using the weight, age and sex.
- Simple linear model
- Linear model with non linear interactions
- Random Forest
- GridSearch to find the best paramets
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import style
style.use('fivethirtyeight')
Import the data and learn about existing fields and analyze the dataframe
data = pd.read_csv('dataset/Howell1.csv', sep=';')
data.head()
Visualize the data¶
data.plot(x='age', y='height', kind='scatter')
Fit Regression Line¶
import seaborn as sns
sns.lmplot(x='age',y='height',data=data,fit_reg=True)
As you can see from the data that height of a person increase from age 0 to 20 but tend to stabilize after 20 years. During this phase we can see that its easier to fit a linear model but after that the data doesn’t signify any linear interaction between age and height
A linear model would not perform well on all the dataset. But if we consider only the data in the range age 0 to 20 we can linear model does a better job of fitting the model
Fit a linear model for age < 20¶
sns.lmplot(x='age',y='height',data=data[data.age < 20],fit_reg=True)
Simple Linear model¶
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict
lr = LinearRegression()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
As we can see above the Error is significantly high. Predictions are off quite a bit.
Lets try to help the linear model by adding more features.
- As we see from the data we can probably add a new feature like age < 20
data['age_less_than_20'] = (data.age<20).astype(int)
data.head()
Now lets try to fit the model again with this new feature
lr = LinearRegression()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
We can see that the error decreased, but not a lot. Let go ahead and add a feature which is square of the age and square of the weight
data['squared_age'] = data['age'] ** 2
data['squared_weight'] = data['weight'] ** 2
data.head()
lr = LinearRegression()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
Great! We were able to reduce the error from 89 to 28.5 by adding higher order features. One thing to note is that the model complexity increase as we add more higher order features
Random forest¶
We can use a random forest classifier which can fit the data without even needing any higher order features
data = pd.read_csv('dataset/Howell1.csv', sep=';')
data.head()
from sklearn.ensemble import RandomForestRegressor
lr = RandomForestRegressor()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
The lowest mean squared error without using any higher order features
params = {
'n_estimators': [3, 5, 10, 20, 50],
'max_depth': [3, 5, 7, 9],
'min_samples_leaf' : [1, 2, 3, 4, 5]
}
Use grid_search_cv to find the best parameters
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import StratifiedKFold, KFold
grid = GridSearchCV(estimator=RandomForestRegressor(), param_grid=params, cv=5, verbose=1)
grid.fit(train, target)
grid.best_estimator_
grid.best_params_
grid.best_score_
Use the best params from the grid search above
lr = RandomForestRegressor(n_estimators=20, min_samples_leaf=2, max_depth=5)
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
With random forest we are able to get the lowest mean squared error and able to predict the height within +- 20 cms(squared error). So sqrt(20) = 4.47cms of the true value