## Introduction to linear regression.Â¶

In this jupyter notebook we will start with a very simple problem of predicting the height of the user using the weight, age and sex.

- Simple linear model
- Linear model with non linear interactions
- Random Forest
- GridSearch to find the best paramets

```
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import style
style.use('fivethirtyeight')
```

Import the data and learn about existing fields and analyze the dataframe

```
data = pd.read_csv('dataset/Howell1.csv', sep=';')
```

```
data.head()
```

## Visualize the dataÂ¶

```
data.plot(x='age', y='height', kind='scatter')
```

## Fit Regression LineÂ¶

```
import seaborn as sns
sns.lmplot(x='age',y='height',data=data,fit_reg=True)
```

As you can see from the data that height of a person increase from age 0 to 20 but tend to stabilize after 20 years. During this phase we can see that its easier to fit a linear model but after that the data doesn’t signify any linear interaction between age and height

A linear model would not perform well on all the dataset. But if we consider only the data in the range age 0 to 20 we can linear model does a better job of fitting the model

## Fit a linear model for age < 20Â¶

```
sns.lmplot(x='age',y='height',data=data[data.age < 20],fit_reg=True)
```

## Simple Linear modelÂ¶

```
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict
```

```
lr = LinearRegression()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
```

As we can see above the Error is significantly high. Predictions are off quite a bit.

Lets try to help the linear model by adding more features.

- As we see from the data we can probably add a new feature like age < 20

```
data['age_less_than_20'] = (data.age<20).astype(int)
```

```
data.head()
```

Now lets try to fit the model again with this new feature

```
lr = LinearRegression()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
```

We can see that the error decreased, but not a lot. Let go ahead and add a feature which is square of the age and square of the weight

```
data['squared_age'] = data['age'] ** 2
```

```
data['squared_weight'] = data['weight'] ** 2
```

```
data.head()
```

```
lr = LinearRegression()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
```

Great! We were able to reduce the error from 89 to 28.5 by adding higher order features. One thing to note is that the model complexity increase as we add more higher order features

## Random forestÂ¶

We can use a random forest classifier which can fit the data without even needing any higher order features

```
data = pd.read_csv('dataset/Howell1.csv', sep=';')
```

```
data.head()
```

```
from sklearn.ensemble import RandomForestRegressor
lr = RandomForestRegressor()
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
```

The lowest mean squared error without using any higher order features

```
params = {
'n_estimators': [3, 5, 10, 20, 50],
'max_depth': [3, 5, 7, 9],
'min_samples_leaf' : [1, 2, 3, 4, 5]
}
```

Use grid_search_cv to find the best parameters

```
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import StratifiedKFold, KFold
```

```
```

```
grid = GridSearchCV(estimator=RandomForestRegressor(), param_grid=params, cv=5, verbose=1)
```

```
grid.fit(train, target)
```

```
grid.best_estimator_
```

```
grid.best_params_
```

```
grid.best_score_
```

Use the best params from the grid search above

```
lr = RandomForestRegressor(n_estimators=20, min_samples_leaf=2, max_depth=5)
train = data.loc[:, data.columns != 'height']
target = data.height
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, train, target, cv=10)
fig, ax = plt.subplots()
ax.scatter(target, predicted, edgecolors=(0, 0, 0))
ax.plot([target.min(), target.max()], [target.min(), target.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
error = mean_squared_error(target, predicted)
print("MEAN Squared Error : {}. (Lower the better)".format(error))
```

With random forest we are able to get the lowest mean squared error and able to predict the height within +- 20 cms(squared error). So sqrt(20) = 4.47cms of the true value