Data Description

In this project, we will predict the probability that an auto insurance policy holder files a claim. This a binary classification problem.

We have more than half a million records and 59 features (including already calculated features).

binary features: _bin
categorical features: _cat
continuous or ordinal feafures: ind, reg, car, calc
missing values: -1

Fullforms
ind = individual
reg = registration
car = car
calc = calculated

The target columns signifies whether or not a claim was filed for that policy holder.

Evaluation Metric

From this graph of wikipedia G = A / (A+B). Gini index varies between 0 and 1. Here we have only binary options: rich and poor.

x-axis= number of people (cumulative sum)
y-axis = total income (cumulative sum)

0 = complete equality of richness
1 = complete inequality of richness


This competition
0 = random guessing
1 = maximum score (also remember 2*1-1 = 1 when maximum auc is 1).

If we calculate gini from gini = 2*auc -1 it has range (-1,1). For AUC:

worst binary classifier AUC = 0.5
perfect binary classifier AUC = 1

If AUC is less than below, simply simply invert 0 <==> 1 then we will get roc auc score between 0.5 and 1.0

Imports

In [ ]:
import os
import time
import gc
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import seaborn as sns
sns.set(color_codes=True)
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint

%matplotlib inline
time_start_notebook = time.time()
SEED=100
print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])
[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('seaborn', '0.10.1'), ('matplotlib', '3.2.2')]
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [ ]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
In [ ]:
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
In [ ]:
# Google colab
In [ ]:
%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:

    # extra modules
    !pip install rgf_python # regularized greedy forest
    !pip install catboost

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.
In [ ]:
from catboost import CatBoostClassifier

# Regularized Greedy Forest
from rgf.sklearn import RGFClassifier     # https://github.com/fukatani/rgf_python

Useful Functions

In [ ]:
df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                        'NormalizedGini': []
                    })

Load the data

In [ ]:
df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/'
    'Porto_seguro_safe_driver_prediction/train.csv.zip?raw=true',compression='zip')
print(df.shape)


# faster runtime
# df = df.sample(frac=0.01,random_state=SEED)
df.head()
(595212, 59)
Out[ ]:
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 0 0 0 0 0 0 11 0 1 0 0.7 0.2 0.718070 10 1 -1 0 1 4 1 0 0 1 12 2 0.400000 0.883679 0.370810 3.605551 0.6 0.5 0.2 3 1 10 1 10 1 5 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 0 0 0 0 0 0 3 0 0 1 0.8 0.4 0.766078 11 1 -1 0 -1 11 1 1 2 1 19 3 0.316228 0.618817 0.388716 2.449490 0.3 0.1 0.3 2 1 9 5 8 1 7 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.0 0.0 -1.000000 7 1 -1 0 -1 14 1 1 2 1 60 1 0.316228 0.641586 0.347275 3.316625 0.5 0.7 0.1 2 2 9 1 8 2 7 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0 8 1 0 0 0.9 0.2 0.580948 7 1 0 0 1 11 1 1 3 1 104 1 0.374166 0.542949 0.294958 2.000000 0.6 0.9 0.1 2 4 7 1 8 4 2 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 0 0 0 0 0 0 9 1 0 0 0.7 0.6 0.840759 11 1 -1 0 -1 14 1 1 2 1 82 3 0.316070 0.565832 0.365103 2.000000 0.4 0.6 0.0 2 2 6 3 10 2 12 3 1 1 3 0 0 0 1 1 0
In [ ]:
"""
Comment about file size:
The data is large, it has 595k records and 59 features.

ps = porto seguro
_bin = binary feature
_cat = categorical feature


continuous or ordinal: ind, reg, car, calc

""";
In [ ]:
target = 'target'

Data Processing

In [ ]:
# all features except target
cols_all= df.columns.drop(target).to_list() 

# categorical features except later created count
cols_cat = [c for c in cols_all if ('cat' in c and 'count' not in c)]

# we exclude calc features in numeric features
cols_num = [c for c in cols_all if ('cat' not in c and 'calc' not in c)]

print(cols_num)

# ohe
df = pd.get_dummies(df,columns=cols_cat,drop_first=True)
['id', 'ps_ind_01', 'ps_ind_03', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

Train-test Split with Stratify

In [ ]:
from sklearn.model_selection import train_test_split

df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    df.drop(target,axis=1),df[target],
    test_size=0.2,random_state=SEED, stratify=df[target])

# backup and delete id
cols_drop = ['id']
train_id = df_Xtrain[cols_drop]
test_id = df_Xtest[cols_drop]
df_Xtrain = df_Xtrain.drop(cols_drop,axis=1)
df_Xtest = df_Xtest.drop(cols_drop,axis=1)

Xtrain = df_Xtrain.to_numpy()
ytrain = ser_ytrain.to_numpy().ravel()

Xtest = df_Xtest.to_numpy()
ytest = ser_ytest.to_numpy().ravel()

# make sure no nans and no strings
print(Xtrain.sum().sum())
43512482.37562419

Training Data

In [ ]:
pd.set_option('display.max_columns',250)
df_Xtrain.head()
Out[ ]:
ps_ind_01 ps_ind_03 ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin ps_ind_02_cat_1 ps_ind_02_cat_2 ps_ind_02_cat_3 ps_ind_02_cat_4 ps_ind_04_cat_0 ps_ind_04_cat_1 ps_ind_05_cat_0 ps_ind_05_cat_1 ps_ind_05_cat_2 ps_ind_05_cat_3 ps_ind_05_cat_4 ps_ind_05_cat_5 ps_ind_05_cat_6 ps_car_01_cat_0 ps_car_01_cat_1 ps_car_01_cat_2 ps_car_01_cat_3 ps_car_01_cat_4 ps_car_01_cat_5 ps_car_01_cat_6 ps_car_01_cat_7 ps_car_01_cat_8 ps_car_01_cat_9 ps_car_01_cat_10 ps_car_01_cat_11 ps_car_02_cat_0 ps_car_02_cat_1 ps_car_03_cat_0 ps_car_03_cat_1 ps_car_04_cat_1 ps_car_04_cat_2 ps_car_04_cat_3 ps_car_04_cat_4 ps_car_04_cat_5 ps_car_04_cat_6 ps_car_04_cat_7 ps_car_04_cat_8 ps_car_04_cat_9 ps_car_05_cat_0 ps_car_05_cat_1 ps_car_06_cat_1 ps_car_06_cat_2 ps_car_06_cat_3 ps_car_06_cat_4 ps_car_06_cat_5 ps_car_06_cat_6 ps_car_06_cat_7 ps_car_06_cat_8 ps_car_06_cat_9 ps_car_06_cat_10 ps_car_06_cat_11 ps_car_06_cat_12 ps_car_06_cat_13 ps_car_06_cat_14 ps_car_06_cat_15 ps_car_06_cat_16 ps_car_06_cat_17 ps_car_07_cat_0 ps_car_07_cat_1 ps_car_08_cat_1 ps_car_09_cat_0 ps_car_09_cat_1 ps_car_09_cat_2 ps_car_09_cat_3 ps_car_09_cat_4 ps_car_10_cat_1 ps_car_10_cat_2 ps_car_11_cat_2 ps_car_11_cat_3 ps_car_11_cat_4 ps_car_11_cat_5 ps_car_11_cat_6 ps_car_11_cat_7 ps_car_11_cat_8 ps_car_11_cat_9 ps_car_11_cat_10 ps_car_11_cat_11 ps_car_11_cat_12 ps_car_11_cat_13 ps_car_11_cat_14 ps_car_11_cat_15 ps_car_11_cat_16 ps_car_11_cat_17 ps_car_11_cat_18 ps_car_11_cat_19 ps_car_11_cat_20 ps_car_11_cat_21 ps_car_11_cat_22 ps_car_11_cat_23 ps_car_11_cat_24 ps_car_11_cat_25 ps_car_11_cat_26 ps_car_11_cat_27 ps_car_11_cat_28 ps_car_11_cat_29 ps_car_11_cat_30 ps_car_11_cat_31 ps_car_11_cat_32 ps_car_11_cat_33 ps_car_11_cat_34 ps_car_11_cat_35 ps_car_11_cat_36 ps_car_11_cat_37 ps_car_11_cat_38 ps_car_11_cat_39 ps_car_11_cat_40 ps_car_11_cat_41 ps_car_11_cat_42 ps_car_11_cat_43 ps_car_11_cat_44 ps_car_11_cat_45 ps_car_11_cat_46 ps_car_11_cat_47 ps_car_11_cat_48 ps_car_11_cat_49 ps_car_11_cat_50 ps_car_11_cat_51 ps_car_11_cat_52 ps_car_11_cat_53 ps_car_11_cat_54 ps_car_11_cat_55 ps_car_11_cat_56 ps_car_11_cat_57 ps_car_11_cat_58 ps_car_11_cat_59 ps_car_11_cat_60 ps_car_11_cat_61 ps_car_11_cat_62 ps_car_11_cat_63 ps_car_11_cat_64 ps_car_11_cat_65 ps_car_11_cat_66 ps_car_11_cat_67 ps_car_11_cat_68 ps_car_11_cat_69 ps_car_11_cat_70 ps_car_11_cat_71 ps_car_11_cat_72 ps_car_11_cat_73 ps_car_11_cat_74 ps_car_11_cat_75 ps_car_11_cat_76 ps_car_11_cat_77 ps_car_11_cat_78 ps_car_11_cat_79 ps_car_11_cat_80 ps_car_11_cat_81 ps_car_11_cat_82 ps_car_11_cat_83 ps_car_11_cat_84 ps_car_11_cat_85 ps_car_11_cat_86 ps_car_11_cat_87 ps_car_11_cat_88 ps_car_11_cat_89 ps_car_11_cat_90 ps_car_11_cat_91 ps_car_11_cat_92 ps_car_11_cat_93 ps_car_11_cat_94 ps_car_11_cat_95 ps_car_11_cat_96 ps_car_11_cat_97 ps_car_11_cat_98 ps_car_11_cat_99 ps_car_11_cat_100 ps_car_11_cat_101 ps_car_11_cat_102 ps_car_11_cat_103 ps_car_11_cat_104
422636 0 6 1 0 0 0 0 0 0 0 0 12 1 0 0 0.9 0.2 0.422788 3 0.316228 0.704575 0.368511 3.316625 0.6 0.6 0.9 4 1 6 4 11 2 5 8 4 7 5 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
374646 1 5 0 0 1 0 0 0 0 0 0 3 0 0 1 0.6 0.5 0.844837 2 0.316228 0.709149 0.368782 3.605551 0.2 0.2 0.2 2 3 7 1 10 3 14 8 2 6 7 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
380900 5 4 0 0 1 0 0 0 0 0 0 7 1 0 0 0.8 0.3 1.114114 2 0.374166 0.837845 0.401746 3.605551 0.2 0.3 0.4 2 3 8 4 11 2 9 10 2 4 13 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
318036 5 8 0 0 0 1 0 0 0 0 0 6 1 0 0 0.4 0.6 0.841130 3 0.447214 0.817862 0.424617 3.000000 0.5 0.6 0.3 1 4 8 7 7 5 15 4 2 1 10 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
7042 0 3 1 0 0 0 0 0 0 0 0 0 1 0 0 0.6 0.4 0.809707 3 0.446990 0.859379 0.451110 2.828427 0.7 0.7 0.8 3 3 8 3 9 3 4 8 2 2 7 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [ ]:
# df_Xtrain.columns # make sure there are no id and index
In [ ]:
Xtr = Xtrain
Xtx = Xtest
ytr = ytrain
ytx = ytest

print(Xtr.shape, Xtx.shape)
(476169, 213) (119043, 213)
In [ ]:
ser_ytest.value_counts(normalize=True)
Out[ ]:
0    0.963551
1    0.036449
Name: target, dtype: float64
In [ ]:
#gini scoring function from kernel at: 
#https://www.kaggle.com/tezdhar/faster-gini-calculation
def ginic(actual, pred):
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_c[-1] - (n + 1) / 2.0
    return giniSum / n
 
def gini_normalizedc(a, p):
    return ginic(a, p) / ginic(a, a)

Data processing

In [ ]:
# remove calc features
cols_use = [c for c in df_Xtrain.columns if (not c.startswith('ps_calc_'))]

df_Xtrain = df_Xtrain[cols_use]
df_Xtest = df_Xtest[cols_use]
In [ ]:
class Ensemble():
    def __init__(self, n_splits, stacker, base_models, model_names):
        self.n_splits = n_splits
        self.stacker = stacker
        self.base_models = base_models
        self.model_names = model_names

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T) # test


        skf = StratifiedKFold(n_splits=self.n_splits,
                            shuffle=True, random_state=SEED)

        folds = list(skf.split(X, y)) # we need to make list

        # stack outputs (ncolumns = len of models)
        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))

        model_names = self.model_names
        time_start = time.time()
        for i, clf in enumerate(self.base_models):

            print('Model: ', model_names[i])

            # init test output for this model
            S_test_i = np.zeros((T.shape[0], self.n_splits))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]

                print (f"  Fold {j+1}")
                clf.fit(X_train, y_train)
#                cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
#                print("    cv AUC: %.5f" % (cross_score.mean()))
                y_prob = clf.predict_proba(X_holdout)[:,1]                

                S_train[test_idx, i] = y_prob
                S_test_i[:, j] = clf.predict_proba(T)[:,1]

                # time taken
                time_taken = time.time() - time_start
                h,m = divmod(time_taken,60*60)
                print('  Time taken : {:.0f} hr '\
                    '{:.0f} min {:.0f} secs\n'.format(h, *divmod(m,60)))
    
            S_test[:, i] = S_test_i.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        print("Stacker AUC: %.5f" % (results.mean()))

        self.stacker.fit(S_train, y)
        res = self.stacker.predict_proba(S_test)[:,1]
        return res

Parameters

In [ ]:
# LightGBM params
lgb_params = {}
lgb_params['learning_rate'] = 0.02
lgb_params['n_estimators'] = 650
lgb_params['max_bin'] = 10
lgb_params['subsample'] = 0.8
lgb_params['subsample_freq'] = 10
lgb_params['colsample_bytree'] = 0.8   
lgb_params['min_child_samples'] = 500
lgb_params['seed'] = SEED


lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3   
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = SEED


lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = SEED


# RandomForest params
#rf_params = {}
#rf_params['n_estimators'] = 200
#rf_params['max_depth'] = 6
#rf_params['min_samples_split'] = 70
#rf_params['min_samples_leaf'] = 30
#rf_params['random_state'] = SEED


# ExtraTrees params
#et_params = {}
#et_params['n_estimators'] = 155
#et_params['max_features'] = 0.3
#et_params['max_depth'] = 6
#et_params['min_samples_split'] = 40
#et_params['min_samples_leaf'] = 18
#et_params['random_state'] = SEED

# XGBoost params
#xgb_params = {}
#xgb_params['objective'] = 'binary:logistic'
#xgb_params['learning_rate'] = 0.04
#xgb_params['n_estimators'] = 490
#xgb_params['max_depth'] = 4
#xgb_params['subsample'] = 0.9
#xgb_params['colsample_bytree'] = 0.9  
#xgb_params['min_child_weight'] = 10
#xgb_params['random_state'] = SEED


# CatBoost params
#cat_params = {}
#cat_params['iterations'] = 900
#cat_params['depth'] = 8
#cat_params['rsm'] = 0.95
#cat_params['learning_rate'] = 0.03
#cat_params['l2_leaf_reg'] = 3.5  
#cat_params['border_count'] = 8
#cat_params['gradient_iterations'] = 4
#cat_params['random_state'] = SEED


# Regularized Greedy Forest params
#rgf_params = {}
#rgf_params['max_leaf'] = 2000
#rgf_params['learning_rate'] = 0.5
#rgf_params['algorithm'] = "RGF_Sib"
#rgf_params['test_interval'] = 100
#rgf_params['min_samples_leaf'] = 3 
#rgf_params['reg_depth'] = 1.0
#rgf_params['l2'] = 0.5  
#rgf_params['sl2'] = 0.005

Models

In [ ]:
lgb_model = LGBMClassifier(**lgb_params)

lgb_model2 = LGBMClassifier(**lgb_params2)

lgb_model3 = LGBMClassifier(**lgb_params3)

#rf_model = RandomForestClassifier(**rf_params)

#et_model = ExtraTreesClassifier(**et_params)
        
#xgb_model = XGBClassifier(**xgb_params)

#cat_model = CatBoostClassifier(**cat_params)

#rgf_model = RGFClassifier(**rgf_params) 

#gb_model = GradientBoostingClassifier(max_depth=5)

#ada_model = AdaBoostClassifier()

log_model = LogisticRegression()

Stacking

In [ ]:
model_names = ['lgb1','lgb2','lgb3']
base_models = [lgb_model, lgb_model2, lgb_model3]

stack = Ensemble(n_splits=3,
        stacker = log_model,
        base_models = base_models,
        model_names = model_names
        )
In [ ]:
yprobs = stack.fit_predict(df_Xtrain, ser_ytrain, df_Xtest)
score = gini_normalizedc(ser_ytest.to_numpy(), yprobs)
print('normalized gini score ', score)
Model:  lgb1
  Fold 1
  Time taken : 0 hr 0 min 50 secs

  Fold 2
  Time taken : 0 hr 1 min 41 secs

  Fold 3
  Time taken : 0 hr 2 min 33 secs

Model:  lgb2
  Fold 1
  Time taken : 0 hr 3 min 28 secs

  Fold 2
  Time taken : 0 hr 4 min 22 secs

  Fold 3
  Time taken : 0 hr 5 min 17 secs

Model:  lgb3
  Fold 1
  Time taken : 0 hr 6 min 13 secs

  Fold 2
  Time taken : 0 hr 7 min 8 secs

  Fold 3
  Time taken : 0 hr 8 min 4 secs

Stacker AUC: 0.63924
normalized gini score  0.29426379136619024
In [ ]:
df_sub = pd.DataFrame({'id': test_id.to_numpy().ravel(),
                       'target': yprobs})

df_sub.head()
Out[ ]:
id target
0 1392758 0.033308
1 1273917 0.042591
2 28224 0.027074
3 228253 0.033086
4 641382 0.030880

Time Taken

In [ ]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))