Table of Contents

Data Description

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Task: Try to estimate the price based on given features.

Notes for Linear Regression

General Modelling Tips

# sklearn is slower
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(model, param_grid,cv=5,n_jobs=-1
                           scoring='accuracy',verbose=2,random_state=random_state)

grid_search.fit(Xtrain, ytrain)


# using dask is faster
import dask
import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(Xtrain, ytrain)

Imports

Useful Scripts

Load the data

Simple Linear Regression

Simple linear model has only one feature and one target. Here our target is price. From the correlation plot, I see that sqft_living is the most important feature. So, I will build a simple linear regression with sqft_living vs price.

$$ h_{\theta}(X)=\theta_{0}+\theta_{1} x $$
theta_0 = lr.intercept_
theta_1 = lr.coef_

Train-test split

modelling

model weights

prediction

evaluation

prediction visualization

Feature Selection

Multiple Linear Regression

When we have more than one features to estimate the target, it is called multiple linear regression. The equation of the model is given below:

$$ h_{\theta}(X)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n} $$

For multiple linear regression we also have Adjusted R-squared values to better account additional number of features used.

$$ \overline{R^{2}}=R^{2}-\frac{k-1}{n-k}\left(1-R^{2}\right) $$

where n is the number of observations and k is the number of parameters.

Multiple Linear Regression - Some processed features

Multiple Linear Regression - Many processed features

All raw features + age_binned + age_renovated_binned

Multiple Linear Regrerssion - Ridge Regularization L2

Popular regularization methods:

Ridge regression is called L2 regularization and by adding a penalty, we obtain the below equation $$ R S S_{R I D G E}=\sum_{i=1}^{m}\left(h_{\theta}\left(x_{i}\right)-y_{i}\right)^{2}+\alpha \sum_{j=1}^{n} \theta_{j}^{2} $$

Multiple Linear Regression - Lasso Regularization L1

Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).

The optimization objective for Lasso is:

(1 / (2 n_samples)) ||y - Xw||^2_2 + alpha * ||w||_1

Polynomial Regression

Polynomial Regression - deg = 2 few raw features

Polynomial Regression - deg = 3 few raw features

Polynomial Regression - deg = 2 all raw features

Polynomial Regression - deg = 3 all raw features

Polynomial Regression - deg = 2 many processed features

Polynomial Regression - deg = 2 many processed features Ridge alpha = 1

Polynomial Regression - deg = 2 many processed features Ridge alpha = 50_000

Polynomial Regression - deg = 2 many processed features Lasso alpha = 1

Polynomial Regression - deg = 2 many processed features Lasso alpha = 50_000

Linear Model LassoLarsCV

Notes

The object solves the same problem as the LassoCV object. However, unlike the LassoCV, it find the relevant alphas values by itself. In general, because of this property, it will be more stable. However, it is more fragile to heavily multicollinear datasets.

It is more efficient than the LassoCV if only a small number of features are selected compared to the total number, for instance if there are very few samples compared to the number of features.

Summary

Feature Importance for Lasso Regression

Best Model So Far

Best model: Polynomial Regression deg=2, all features, unprocessed, no regularization

Transform Target Regressor