Kernel Author:
Bhishan Poudel, Ph.D Astrophysics .

Data Description¶

In this project, we will predict the probability that an auto insurance policy holder files a claim. This a binary classification problem.

We have more than half a million records and 59 features (including already calculated features).

binary features: _bin
categorical features: _cat
continuous or ordinal feafures: ind, reg, car, calc
missing values: -1

Fullforms
ind = individual
reg = registration
car = car
calc = calculated

The target columns signifies whether or not a claim was filed for that policy holder.

Evaluation Metric¶

From this graph of wikipedia G = A / (A+B). Gini index varies between 0 and 1. Here we have only binary options: rich and poor.

x-axis= number of people (cumulative sum)
y-axis = total income (cumulative sum)

0 = complete equality of richness
1 = complete inequality of richness


This competition
0 = random guessing
1 = maximum score (also remember 2*1-1 = 1 when maximum auc is 1).

If we calculate gini from gini = 2*auc -1 it has range (-1,1). For AUC:

worst binary classifier AUC = 0.5
perfect binary classifier AUC = 1

If AUC is less than below, simply simply invert 0 <==> 1 then we will get roc auc score between 0.5 and 1.0

Imports¶

import os
import time
import gc
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import seaborn as sns
sns.set(color_codes=True)
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint

%matplotlib inline
time_start_notebook = time.time()
SEED=100
print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])

[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('seaborn', '0.10.1'), ('matplotlib', '3.2.2')]

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Google colab

%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:

    # extra modules
    !pip install rgf_python # regularized greedy forest
    !pip install catboost

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.

from catboost import CatBoostClassifier

# Regularized Greedy Forest
from rgf.sklearn import RGFClassifier     # https://github.com/fukatani/rgf_python

Useful Functions¶

df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                        'NormalizedGini': []
                    })

Load the data¶

df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/'
    'Porto_seguro_safe_driver_prediction/train.csv.zip?raw=true',compression='zip')
print(df.shape)


# faster runtime
# df = df.sample(frac=0.01,random_state=SEED)
df.head()

(595212, 59)

"""
Comment about file size:
The data is large, it has 595k records and 59 features.

ps = porto seguro
_bin = binary feature
_cat = categorical feature


continuous or ordinal: ind, reg, car, calc

""";

target = 'target'

Data Processing¶

# all features except target
cols_all= df.columns.drop(target).to_list() 

# categorical features except later created count
cols_cat = [c for c in cols_all if ('cat' in c and 'count' not in c)]

# we exclude calc features in numeric features
cols_num = [c for c in cols_all if ('cat' not in c and 'calc' not in c)]

print(cols_num)

# ohe
df = pd.get_dummies(df,columns=cols_cat,drop_first=True)

['id', 'ps_ind_01', 'ps_ind_03', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

Train-test Split with Stratify¶

from sklearn.model_selection import train_test_split

df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    df.drop(target,axis=1),df[target],
    test_size=0.2,random_state=SEED, stratify=df[target])

# backup and delete id
cols_drop = ['id']
train_id = df_Xtrain[cols_drop]
test_id = df_Xtest[cols_drop]
df_Xtrain = df_Xtrain.drop(cols_drop,axis=1)
df_Xtest = df_Xtest.drop(cols_drop,axis=1)

Xtrain = df_Xtrain.to_numpy()
ytrain = ser_ytrain.to_numpy().ravel()

Xtest = df_Xtest.to_numpy()
ytest = ser_ytest.to_numpy().ravel()

# make sure no nans and no strings
print(Xtrain.sum().sum())

43512482.37562419

Training Data¶

pd.set_option('display.max_columns',250)
df_Xtrain.head()

# df_Xtrain.columns # make sure there are no id and index

Xtr = Xtrain
Xtx = Xtest
ytr = ytrain
ytx = ytest

print(Xtr.shape, Xtx.shape)

(476169, 213) (119043, 213)

ser_ytest.value_counts(normalize=True)

0    0.963551
1    0.036449
Name: target, dtype: float64

Evaluation Metric¶

https://www.kaggle.com/rshally/porto-xgb-lgb-kfold-lb-0-282

#gini scoring function from kernel at: 
#https://www.kaggle.com/tezdhar/faster-gini-calculation
def ginic(actual, pred):
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_c[-1] - (n + 1) / 2.0
    return giniSum / n
 
def gini_normalizedc(a, p):
    return ginic(a, p) / ginic(a, a)

Data processing¶

# remove calc features
cols_use = [c for c in df_Xtrain.columns if (not c.startswith('ps_calc_'))]

df_Xtrain = df_Xtrain[cols_use]
df_Xtest = df_Xtest[cols_use]

Ensemble¶

https://www.kaggle.com/yekenot/simple-stacker-lb-0-284

class Ensemble():
    def __init__(self, n_splits, stacker, base_models, model_names):
        self.n_splits = n_splits
        self.stacker = stacker
        self.base_models = base_models
        self.model_names = model_names

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T) # test


        skf = StratifiedKFold(n_splits=self.n_splits,
                            shuffle=True, random_state=SEED)

        folds = list(skf.split(X, y)) # we need to make list

        # stack outputs (ncolumns = len of models)
        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))

        model_names = self.model_names
        time_start = time.time()
        for i, clf in enumerate(self.base_models):

            print('Model: ', model_names[i])

            # init test output for this model
            S_test_i = np.zeros((T.shape[0], self.n_splits))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]

                print (f"  Fold {j+1}")
                clf.fit(X_train, y_train)
#                cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
#                print("    cv AUC: %.5f" % (cross_score.mean()))
                y_prob = clf.predict_proba(X_holdout)[:,1]                

                S_train[test_idx, i] = y_prob
                S_test_i[:, j] = clf.predict_proba(T)[:,1]

                # time taken
                time_taken = time.time() - time_start
                h,m = divmod(time_taken,60*60)
                print('  Time taken : {:.0f} hr '\
                    '{:.0f} min {:.0f} secs\n'.format(h, *divmod(m,60)))
    
            S_test[:, i] = S_test_i.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        print("Stacker AUC: %.5f" % (results.mean()))

        self.stacker.fit(S_train, y)
        res = self.stacker.predict_proba(S_test)[:,1]
        return res

Parameters¶

# LightGBM params
lgb_params = {}
lgb_params['learning_rate'] = 0.02
lgb_params['n_estimators'] = 650
lgb_params['max_bin'] = 10
lgb_params['subsample'] = 0.8
lgb_params['subsample_freq'] = 10
lgb_params['colsample_bytree'] = 0.8   
lgb_params['min_child_samples'] = 500
lgb_params['seed'] = SEED


lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3   
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = SEED


lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = SEED


# RandomForest params
#rf_params = {}
#rf_params['n_estimators'] = 200
#rf_params['max_depth'] = 6
#rf_params['min_samples_split'] = 70
#rf_params['min_samples_leaf'] = 30
#rf_params['random_state'] = SEED


# ExtraTrees params
#et_params = {}
#et_params['n_estimators'] = 155
#et_params['max_features'] = 0.3
#et_params['max_depth'] = 6
#et_params['min_samples_split'] = 40
#et_params['min_samples_leaf'] = 18
#et_params['random_state'] = SEED

# XGBoost params
#xgb_params = {}
#xgb_params['objective'] = 'binary:logistic'
#xgb_params['learning_rate'] = 0.04
#xgb_params['n_estimators'] = 490
#xgb_params['max_depth'] = 4
#xgb_params['subsample'] = 0.9
#xgb_params['colsample_bytree'] = 0.9  
#xgb_params['min_child_weight'] = 10
#xgb_params['random_state'] = SEED


# CatBoost params
#cat_params = {}
#cat_params['iterations'] = 900
#cat_params['depth'] = 8
#cat_params['rsm'] = 0.95
#cat_params['learning_rate'] = 0.03
#cat_params['l2_leaf_reg'] = 3.5  
#cat_params['border_count'] = 8
#cat_params['gradient_iterations'] = 4
#cat_params['random_state'] = SEED


# Regularized Greedy Forest params
#rgf_params = {}
#rgf_params['max_leaf'] = 2000
#rgf_params['learning_rate'] = 0.5
#rgf_params['algorithm'] = "RGF_Sib"
#rgf_params['test_interval'] = 100
#rgf_params['min_samples_leaf'] = 3 
#rgf_params['reg_depth'] = 1.0
#rgf_params['l2'] = 0.5  
#rgf_params['sl2'] = 0.005

Models¶

lgb_model = LGBMClassifier(**lgb_params)

lgb_model2 = LGBMClassifier(**lgb_params2)

lgb_model3 = LGBMClassifier(**lgb_params3)

#rf_model = RandomForestClassifier(**rf_params)

#et_model = ExtraTreesClassifier(**et_params)
        
#xgb_model = XGBClassifier(**xgb_params)

#cat_model = CatBoostClassifier(**cat_params)

#rgf_model = RGFClassifier(**rgf_params) 

#gb_model = GradientBoostingClassifier(max_depth=5)

#ada_model = AdaBoostClassifier()

log_model = LogisticRegression()

Stacking¶

model_names = ['lgb1','lgb2','lgb3']
base_models = [lgb_model, lgb_model2, lgb_model3]

stack = Ensemble(n_splits=3,
        stacker = log_model,
        base_models = base_models,
        model_names = model_names
        )

yprobs = stack.fit_predict(df_Xtrain, ser_ytrain, df_Xtest)
score = gini_normalizedc(ser_ytest.to_numpy(), yprobs)
print('normalized gini score ', score)

Model:  lgb1
  Fold 1
  Time taken : 0 hr 0 min 50 secs

  Fold 2
  Time taken : 0 hr 1 min 41 secs

  Fold 3
  Time taken : 0 hr 2 min 33 secs

Model:  lgb2
  Fold 1
  Time taken : 0 hr 3 min 28 secs

  Fold 2
  Time taken : 0 hr 4 min 22 secs

  Fold 3
  Time taken : 0 hr 5 min 17 secs

Model:  lgb3
  Fold 1
  Time taken : 0 hr 6 min 13 secs

  Fold 2
  Time taken : 0 hr 7 min 8 secs

  Fold 3
  Time taken : 0 hr 8 min 4 secs

Stacker AUC: 0.63924
normalized gini score  0.29426379136619024

df_sub = pd.DataFrame({'id': test_id.to_numpy().ravel(),
                       'target': yprobs})

df_sub.head()

Time Taken¶

time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

	ps_ind_01	ps_ind_03	ps_ind_06_bin	ps_ind_08_bin	ps_ind_09_bin	ps_ind_15	ps_ind_16_bin	ps_ind_18_bin	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_11	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_20_bin	ps_ind_02_cat_1	ps_ind_04_cat_0	ps_ind_04_cat_1	ps_ind_05_cat_0	ps_ind_05_cat_2	ps_ind_05_cat_4	ps_car_01_cat_7	ps_car_01_cat_11	ps_car_02_cat_0	ps_car_02_cat_1	ps_car_06_cat_1	ps_car_06_cat_11	ps_car_06_cat_14	ps_car_07_cat_1	ps_car_08_cat_1	ps_car_09_cat_2	ps_car_10_cat_1	ps_car_11_cat_11	ps_car_11_cat_30	ps_car_11_cat_51	ps_car_11_cat_82	ps_car_11_cat_104
422636	0	6	1	0	0	12	1	0	0.9	0.2	0.422788	3	0.316228	0.704575	0.368511	3.316625	0.6	0.6	0.9	4	1	6	4	11	2	5	8	4	7	5	1	1	1	0	1	1	0	0	1	0	0	1	0	1	0	0	1	1	1	1	1	0	0	0	1	0
374646	1	5	0	1	0	3	0	1	0.6	0.5	0.844837	2	0.316228	0.709149	0.368782	3.605551	0.2	0.2	0.2	2	3	7	1	10	3	14	8	2	6	7	0	1	0	0	1	1	0	1	0	0	1	0	0	1	0	1	0	1	1	1	1	1	0	0	0	0
380900	5	4	0	1	0	7	1	0	0.8	0.3	1.114114	2	0.374166	0.837845	0.401746	3.605551	0.2	0.3	0.4	2	3	8	4	11	2	9	10	2	4	13	1	0	0	0	1	1	0	1	0	0	0	1	0	1	1	0	0	1	1	1	1	0	0	1	0	0
318036	5	8	0	0	1	6	1	0	0.4	0.6	0.841130	3	0.447214	0.817862	0.424617	3.000000	0.5	0.6	0.3	1	4	8	7	7	5	15	4	2	1	10	0	0	0	0	1	0	1	0	0	1	0	1	0	1	0	1	0	1	1	1	1	0	0	0	0	1
7042	0	3	1	0	0	0	1	0	0.6	0.4	0.809707	3	0.446990	0.859379	0.451110	2.828427	0.7	0.7	0.8	3	3	8	3	9	3	4	8	2	2	7	0	1	0	1	1	1	0	1	0	0	0	1	1	0	0	1	0	1	1	1	1	0	1	0	0	0

	id	target
0	1392758	0.033308
1	1273917	0.042591
2	28224	0.027074
3	228253	0.033086
4	641382	0.030880

	id	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	ps_ind_15	ps_ind_16_bin	ps_ind_17_bin	ps_ind_18_bin	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_01_cat	ps_car_02_cat	ps_car_03_cat	ps_car_05_cat	ps_car_06_cat	ps_car_07_cat	ps_car_08_cat	ps_car_09_cat	ps_car_10_cat	ps_car_11_cat	ps_car_11	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	2	2	5	1	0	1	0	11	0	1	0	0.7	0.2	0.718070	10	1	-1	1	4	1	0	0	1	12	2	0.400000	0.883679	0.370810	3.605551	0.6	0.5	0.2	3	1	10	1	10	1	5	9	1	5	8	1	1	0	0	1
1	9	1	1	7	0	0	0	1	3	0	0	1	0.8	0.4	0.766078	11	1	-1	-1	11	1	1	2	1	19	3	0.316228	0.618817	0.388716	2.449490	0.3	0.1	0.3	2	1	9	5	8	1	7	3	1	1	9	1	1	0	1	0
2	13	5	4	9	1	0	0	1	12	1	0	0	0.0	0.0	-1.000000	7	1	-1	-1	14	1	1	2	1	60	1	0.316228	0.641586	0.347275	3.316625	0.5	0.7	0.1	2	2	9	1	8	2	7	4	2	7	7	1	1	0	1	0
3	16	0	1	2	0	1	0	0	8	1	0	0	0.9	0.2	0.580948	7	1	0	1	11	1	1	3	1	104	1	0.374166	0.542949	0.294958	2.000000	0.6	0.9	0.1	2	4	7	1	8	4	2	2	2	4	9	0	0	0	0	0
4	17	0	2	0	1	1	0	0	9	1	0	0	0.7	0.6	0.840759	11	1	-1	-1	14	1	1	2	1	82	3	0.316070	0.565832	0.365103	2.000000	0.4	0.6	0.0	2	2	6	3	10	2	12	3	1	1	3	0	0	1	1	0