Table of Contents

Data Description

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Dependent features: 1 (price)
Features : 19 home features
Id: 1 house ID
Task: Try to estimate the price based on given features.

Business Problem

Business Problem:
Task    : Predict the house price based on King County Seattle House price data.
Metric : RMSE
Tools: Use python module Pycaret for regression.
Question: What is the price of new house?

Introduction to Pycaret

Pycaret is a high level python module which requires very few lines of code to solve the machine learning problem at hand. This module is useful when dealing with projects with extreme less time constraints.

It has classes like anomaly, classification, clustering, datasets, nlp, preprecess and regression.

Some resources for pycaret:

Imports

Useful Functions

Load the data

Train test split

Pycaret Setup

setup(data, target, train_size=0.7, sampling=True, sample_estimator=None, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, high_cardinality_method='frequency', numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_outliers=False, outliers_threshold=0.05, remove_multicollinearity=False, multicollinearity_threshold=0.9, create_clusters=False, cluster_iter=20, polynomial_features=False, polynomial_degree=2, trigonometry_features=False, polynomial_threshold=0.1, group_features=None, group_names=None, feature_selection=False, feature_selection_threshold=0.8, feature_interaction=False, feature_ratio=False, interaction_threshold=0.01, session_id=None, silent=False, profile=False)

sampling: bool, default = True
When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance  plot of AUC, Accuracy, Recall, Precision, Kappa and F1 values at various sample  levels, that will assist in deciding the preferred sample size for modeling. 
The desired sample size must then be entered for training and validation in the  pycaret environment. When sample_size entered is less than 1, the remaining dataset  (1 - sample) is used for fitting the model only when finalize_model() is called.

Comparing All Models

compare_models(blacklist = None,fold = 10, round = 4, 
               sort = 'R2',turbo = True)

sort: string, default = 'MAE'
The scoring measure specified is used for sorting the average score grid
Other options are 'MAE', 'MSE', 'RMSE', 'R2', 'RMSLE' and 'MAPE'.

Create Models

Estimator                     Abbreviated String     Original Implementation 
---------                     ------------------     -----------------------
Linear Regression             'lr'                   linear_model.LinearRegression
Lasso Regression              'lasso'                linear_model.Lasso
Ridge Regression              'ridge'                linear_model.Ridge
Elastic Net                   'en'                   linear_model.ElasticNet
Least Angle Regression        'lar'                  linear_model.Lars
Lasso Least Angle Regression  'llar'                 linear_model.LassoLars
Orthogonal Matching Pursuit   'omp'                  linear_model.OMP
Bayesian Ridge                'br'                   linear_model.BayesianRidge
Automatic Relevance Determ.   'ard'                  linear_model.ARDRegression
Passive Aggressive Regressor  'par'                  linear_model.PAR
Random Sample Consensus       'ransac'               linear_model.RANSACRegressor
TheilSen Regressor            'tr'                   linear_model.TheilSenRegressor
Huber Regressor               'huber'                linear_model.HuberRegressor 
Kernel Ridge                  'kr'                   kernel_ridge.KernelRidge
Support Vector Machine        'svm'                  svm.SVR
K Neighbors Regressor         'knn'                  neighbors.KNeighborsRegressor 
Decision Tree                 'dt'                   tree.DecisionTreeRegressor
Random Forest                 'rf'                   ensemble.RandomForestRegressor
Extra Trees Regressor         'et'                   ensemble.ExtraTreesRegressor
AdaBoost Regressor            'ada'                  ensemble.AdaBoostRegressor
Gradient Boosting             'gbr'                  ensemble.GradientBoostingRegressor 
Multi Level Perceptron        'mlp'                  neural_network.MLPRegressor
Extreme Gradient Boosting     'xgboost'              xgboost.readthedocs.io
Light Gradient Boosting       'lightgbm'             github.com/microsoft/LightGBM
CatBoost Regressor            'catboost'             https://catboost.ai

Hyperparameter Tuning

tune_model(estimator=None, fold=10, round=4, n_iter=10, optimize='Accuracy', ensemble=False, method=None, verbose=True)

n_iter: integer, default = 10

optimize: string, default = 'r2'
          options: 'mae', 'mse'.

Model Evaluation for Train data

plot_model(estimator, plot='residuals')

Name                        Abbreviated String     
---------                   ------------------ 
Residuals Plot               'residuals'     
Prediction Error Plot        'error'      
Cooks Distance Plot          'cooks'       
Recursive Feat. Selection    'rfe'                   
Validation Curve             'vc'         
Manifold Learning            'manifold' 
Feature Importance           'feature'    
Model Hyperparameter         'parameter'

Ensemble Modelling

ensemble_model(estimator, method='Bagging', fold=10, n_estimators=10, round=4, verbose=True)

method: 'Bagging' or 'Boosting', default = 'Bagging'

Bagging

Boosting

Blending

blend_models(estimator_list='All', fold=10, round=4, method='hard', turbo=True, verbose=True)

Stacking

Stacking is another popular technique for ensembling but is less commonly implemented due to practical difficulties. Stacking is an ensemble learning technique that combines multiple models via a meta-model. Another way to think about stacking is that multiple models are trained to predict the outcome and a meta-model is created that uses the predictions from those models as an input along with the original features.

Selecting which method and models to use in stacking depends on the statistical properties of the dataset. Experimenting with different models and methods is the best way to find out which configuration will work best. However as a general rule of thumb, the models with strong yet diverse performance tend to improve results when used in stacking. One way to measure diversity is the correlation of predictions between models. You can analyze this using the plot parameter.

stack_models(estimator_list, meta_model=None, fold=10, round=4, method='soft', restack=True, plot=False, finalize=False, verbose=True)

Model Interpretation

Finalize Model for Deployment

Model Persistence

Model Predictions

Model Evaluation for Test Data

Time taken