Data Description

In this project, we will predict the probability that an auto insurance policy holder files a claim. This a binary classification problem.

We have more than half a million records and 59 features (including already calculated features).

binary features: _bin
categorical features: _cat
continuous or ordinal feafures: ind, reg, car, calc
missing values: -1

ind = individual
reg = registration
car = car
calc = calculated

The target columns signifies whether or not a claim was filed for that policy holder.

Evaluation Metric

From this graph of wikipedia G = A / (A+B). Gini index varies between 0 and 1. Here we have only binary options: rich and poor.

x-axis= number of people (cumulative sum)
y-axis = total income (cumulative sum)

0 = complete equality of richness
1 = complete inequality of richness

This competition
0 = random guessing
1 = maximum score (also remember 2*1-1 = 1 when maximum auc is 1).

If we calculate gini from gini = 2*auc -1 it has range (-1,1). For AUC:

worst binary classifier AUC = 0.5
perfect binary classifier AUC = 1

If AUC is less than below, simply simply invert 0 <==> 1 then we will get roc auc score between 0.5 and 1.0


In [ ]:
import os
import time
import gc
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint

%matplotlib inline
time_start_notebook = time.time()
print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])
[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('seaborn', '0.10.1'), ('matplotlib', '3.2.2')]
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/ FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [ ]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
In [ ]:
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
In [ ]:
# Google colab
In [ ]:
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules


    # extra modules
    !pip install rgf_python # regularized greedy forest
    !pip install catboost

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.
In [ ]:
from catboost import CatBoostClassifier

# Regularized Greedy Forest
from rgf.sklearn import RGFClassifier     #

Useful Functions

In [ ]:
df_eval = pd.DataFrame({'Model': [],
                        'NormalizedGini': []

Load the data

In [ ]:
df = pd.read_csv(''

# faster runtime
# df = df.sample(frac=0.01,random_state=SEED)
(595212, 59)
Out[ ]:
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 0 0 0 0 0 0 11 0 1 0 0.7 0.2 0.718070 10 1 -1 0 1 4 1 0 0 1 12 2 0.400000 0.883679 0.370810 3.605551 0.6 0.5 0.2 3 1 10 1 10 1 5 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 0 0 0 0 0 0 3 0 0 1 0.8 0.4 0.766078 11 1 -1 0 -1 11 1 1 2 1 19 3 0.316228 0.618817 0.388716 2.449490 0.3 0.1 0.3 2 1 9 5 8 1 7 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.0 0.0 -1.000000 7 1 -1 0 -1 14 1 1 2 1 60 1 0.316228 0.641586 0.347275 3.316625 0.5 0.7 0.1 2 2 9 1 8 2 7 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0 8 1 0 0 0.9 0.2 0.580948 7 1 0 0 1 11 1 1 3 1 104 1 0.374166 0.542949 0.294958 2.000000 0.6 0.9 0.1 2 4 7 1 8 4 2 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 0 0 0 0 0 0 9 1 0 0 0.7 0.6 0.840759 11 1 -1 0 -1 14 1 1 2 1 82 3 0.316070 0.565832 0.365103 2.000000 0.4 0.6 0.0 2 2 6 3 10 2 12 3 1 1 3 0 0 0 1 1 0
In [ ]:
Comment about file size:
The data is large, it has 595k records and 59 features.

ps = porto seguro
_bin = binary feature
_cat = categorical feature

continuous or ordinal: ind, reg, car, calc

In [ ]:
target = 'target'

Data Processing

In [ ]:
# all features except target
cols_all= df.columns.drop(target).to_list() 

# categorical features except later created count
cols_cat = [c for c in cols_all if ('cat' in c and 'count' not in c)]

# we exclude calc features in numeric features
cols_num = [c for c in cols_all if ('cat' not in c and 'calc' not in c)]


# ohe
df = pd.get_dummies(df,columns=cols_cat,drop_first=True)
['id', 'ps_ind_01', 'ps_ind_03', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

Train-test Split with Stratify

In [ ]:
from sklearn.model_selection import train_test_split

df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    test_size=0.2,random_state=SEED, stratify=df[target])

# backup and delete id
cols_drop = ['id']
train_id = df_Xtrain[cols_drop]
test_id = df_Xtest[cols_drop]
df_Xtrain = df_Xtrain.drop(cols_drop,axis=1)
df_Xtest = df_Xtest.drop(cols_drop,axis=1)

Xtrain = df_Xtrain.to_numpy()
ytrain = ser_ytrain.to_numpy().ravel()

Xtest = df_Xtest.to_numpy()
ytest = ser_ytest.to_numpy().ravel()

# make sure no nans and no strings

Training Data

In [ ]:
Out[ ]:
ps_ind_01 ps_ind_03 ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin ps_ind_02_cat_1 ps_ind_02_cat_2 ps_ind_02_cat_3 ps_ind_02_cat_4 ps_ind_04_cat_0 ps_ind_04_cat_1 ps_ind_05_cat_0 ps_ind_05_cat_1 ps_ind_05_cat_2 ps_ind_05_cat_3 ps_ind_05_cat_4 ps_ind_05_cat_5 ps_ind_05_cat_6 ps_car_01_cat_0 ps_car_01_cat_1 ps_car_01_cat_2 ps_car_01_cat_3 ps_car_01_cat_4 ps_car_01_cat_5 ps_car_01_cat_6 ps_car_01_cat_7 ps_car_01_cat_8 ps_car_01_cat_9 ps_car_01_cat_10 ps_car_01_cat_11 ps_car_02_cat_0 ps_car_02_cat_1 ps_car_03_cat_0 ps_car_03_cat_1 ps_car_04_cat_1 ps_car_04_cat_2 ps_car_04_cat_3 ps_car_04_cat_4 ps_car_04_cat_5 ps_car_04_cat_6 ps_car_04_cat_7 ps_car_04_cat_8 ps_car_04_cat_9 ps_car_05_cat_0 ps_car_05_cat_1 ps_car_06_cat_1 ps_car_06_cat_2 ps_car_06_cat_3 ps_car_06_cat_4 ps_car_06_cat_5 ps_car_06_cat_6 ps_car_06_cat_7 ps_car_06_cat_8 ps_car_06_cat_9 ps_car_06_cat_10 ps_car_06_cat_11 ps_car_06_cat_12 ps_car_06_cat_13 ps_car_06_cat_14 ps_car_06_cat_15 ps_car_06_cat_16 ps_car_06_cat_17 ps_car_07_cat_0 ps_car_07_cat_1 ps_car_08_cat_1 ps_car_09_cat_0 ps_car_09_cat_1 ps_car_09_cat_2 ps_car_09_cat_3 ps_car_09_cat_4 ps_car_10_cat_1 ps_car_10_cat_2 ps_car_11_cat_2 ps_car_11_cat_3 ps_car_11_cat_4 ps_car_11_cat_5 ps_car_11_cat_6 ps_car_11_cat_7 ps_car_11_cat_8 ps_car_11_cat_9 ps_car_11_cat_10 ps_car_11_cat_11 ps_car_11_cat_12 ps_car_11_cat_13 ps_car_11_cat_14 ps_car_11_cat_15 ps_car_11_cat_16 ps_car_11_cat_17 ps_car_11_cat_18 ps_car_11_cat_19 ps_car_11_cat_20 ps_car_11_cat_21 ps_car_11_cat_22 ps_car_11_cat_23 ps_car_11_cat_24 ps_car_11_cat_25 ps_car_11_cat_26 ps_car_11_cat_27 ps_car_11_cat_28 ps_car_11_cat_29 ps_car_11_cat_30 ps_car_11_cat_31 ps_car_11_cat_32 ps_car_11_cat_33 ps_car_11_cat_34 ps_car_11_cat_35 ps_car_11_cat_36 ps_car_11_cat_37 ps_car_11_cat_38 ps_car_11_cat_39 ps_car_11_cat_40 ps_car_11_cat_41 ps_car_11_cat_42 ps_car_11_cat_43 ps_car_11_cat_44 ps_car_11_cat_45 ps_car_11_cat_46 ps_car_11_cat_47 ps_car_11_cat_48 ps_car_11_cat_49 ps_car_11_cat_50 ps_car_11_cat_51 ps_car_11_cat_52 ps_car_11_cat_53 ps_car_11_cat_54 ps_car_11_cat_55 ps_car_11_cat_56 ps_car_11_cat_57 ps_car_11_cat_58 ps_car_11_cat_59 ps_car_11_cat_60 ps_car_11_cat_61 ps_car_11_cat_62 ps_car_11_cat_63 ps_car_11_cat_64 ps_car_11_cat_65 ps_car_11_cat_66 ps_car_11_cat_67 ps_car_11_cat_68 ps_car_11_cat_69 ps_car_11_cat_70 ps_car_11_cat_71 ps_car_11_cat_72 ps_car_11_cat_73 ps_car_11_cat_74 ps_car_11_cat_75 ps_car_11_cat_76 ps_car_11_cat_77 ps_car_11_cat_78 ps_car_11_cat_79 ps_car_11_cat_80 ps_car_11_cat_81 ps_car_11_cat_82 ps_car_11_cat_83 ps_car_11_cat_84 ps_car_11_cat_85 ps_car_11_cat_86 ps_car_11_cat_87 ps_car_11_cat_88 ps_car_11_cat_89 ps_car_11_cat_90 ps_car_11_cat_91 ps_car_11_cat_92 ps_car_11_cat_93 ps_car_11_cat_94 ps_car_11_cat_95 ps_car_11_cat_96 ps_car_11_cat_97 ps_car_11_cat_98 ps_car_11_cat_99 ps_car_11_cat_100 ps_car_11_cat_101 ps_car_11_cat_102 ps_car_11_cat_103 ps_car_11_cat_104
422636 0 6 1 0 0 0 0 0 0 0 0 12 1 0 0 0.9 0.2 0.422788 3 0.316228 0.704575 0.368511 3.316625 0.6 0.6 0.9 4 1 6 4 11 2 5 8 4 7 5 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
374646 1 5 0 0 1 0 0 0 0 0 0 3 0 0 1 0.6 0.5 0.844837 2 0.316228 0.709149 0.368782 3.605551 0.2 0.2 0.2 2 3 7 1 10 3 14 8 2 6 7 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
380900 5 4 0 0 1 0 0 0 0 0 0 7 1 0 0 0.8 0.3 1.114114 2 0.374166 0.837845 0.401746 3.605551 0.2 0.3 0.4 2 3 8 4 11 2 9 10 2 4 13 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
318036 5 8 0 0 0 1 0 0 0 0 0 6 1 0 0 0.4 0.6 0.841130 3 0.447214 0.817862 0.424617 3.000000 0.5 0.6 0.3 1 4 8 7 7 5 15 4 2 1 10 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
7042 0 3 1 0 0 0 0 0 0 0 0 0 1 0 0 0.6 0.4 0.809707 3 0.446990 0.859379 0.451110 2.828427 0.7 0.7 0.8 3 3 8 3 9 3 4 8 2 2 7 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [ ]:
# df_Xtrain.columns # make sure there are no id and index
In [ ]:
Xtr = Xtrain
Xtx = Xtest
ytr = ytrain
ytx = ytest

print(Xtr.shape, Xtx.shape)
(476169, 213) (119043, 213)
In [ ]:
Out[ ]:
0    0.963551
1    0.036449
Name: target, dtype: float64
In [ ]:
#gini scoring function from kernel at: 
def ginic(actual, pred):
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_c[-1] - (n + 1) / 2.0
    return giniSum / n
def gini_normalizedc(a, p):
    return ginic(a, p) / ginic(a, a)

Data processing

In [ ]:
# remove calc features
cols_use = [c for c in df_Xtrain.columns if (not c.startswith('ps_calc_'))]

df_Xtrain = df_Xtrain[cols_use]
df_Xtest = df_Xtest[cols_use]
In [ ]:
class Ensemble():
    def __init__(self, n_splits, stacker, base_models, model_names):
        self.n_splits = n_splits
        self.stacker = stacker
        self.base_models = base_models
        self.model_names = model_names

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T) # test

        skf = StratifiedKFold(n_splits=self.n_splits,
                            shuffle=True, random_state=SEED)

        folds = list(skf.split(X, y)) # we need to make list

        # stack outputs (ncolumns = len of models)
        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))

        model_names = self.model_names
        time_start = time.time()
        for i, clf in enumerate(self.base_models):

            print('Model: ', model_names[i])

            # init test output for this model
            S_test_i = np.zeros((T.shape[0], self.n_splits))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]

                print (f"  Fold {j+1}")
      , y_train)
#                cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
#                print("    cv AUC: %.5f" % (cross_score.mean()))
                y_prob = clf.predict_proba(X_holdout)[:,1]                

                S_train[test_idx, i] = y_prob
                S_test_i[:, j] = clf.predict_proba(T)[:,1]

                # time taken
                time_taken = time.time() - time_start
                h,m = divmod(time_taken,60*60)
                print('  Time taken : {:.0f} hr '\
                    '{:.0f} min {:.0f} secs\n'.format(h, *divmod(m,60)))
            S_test[:, i] = S_test_i.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        print("Stacker AUC: %.5f" % (results.mean())), y)
        res = self.stacker.predict_proba(S_test)[:,1]
        return res


In [ ]:
# LightGBM params
lgb_params = {}
lgb_params['learning_rate'] = 0.02
lgb_params['n_estimators'] = 650
lgb_params['max_bin'] = 10
lgb_params['subsample'] = 0.8
lgb_params['subsample_freq'] = 10
lgb_params['colsample_bytree'] = 0.8   
lgb_params['min_child_samples'] = 500
lgb_params['seed'] = SEED

lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3   
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = SEED

lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = SEED

# RandomForest params
#rf_params = {}
#rf_params['n_estimators'] = 200
#rf_params['max_depth'] = 6
#rf_params['min_samples_split'] = 70
#rf_params['min_samples_leaf'] = 30
#rf_params['random_state'] = SEED

# ExtraTrees params
#et_params = {}
#et_params['n_estimators'] = 155
#et_params['max_features'] = 0.3
#et_params['max_depth'] = 6
#et_params['min_samples_split'] = 40
#et_params['min_samples_leaf'] = 18
#et_params['random_state'] = SEED

# XGBoost params
#xgb_params = {}
#xgb_params['objective'] = 'binary:logistic'
#xgb_params['learning_rate'] = 0.04
#xgb_params['n_estimators'] = 490
#xgb_params['max_depth'] = 4
#xgb_params['subsample'] = 0.9
#xgb_params['colsample_bytree'] = 0.9  
#xgb_params['min_child_weight'] = 10
#xgb_params['random_state'] = SEED

# CatBoost params
#cat_params = {}
#cat_params['iterations'] = 900
#cat_params['depth'] = 8
#cat_params['rsm'] = 0.95
#cat_params['learning_rate'] = 0.03
#cat_params['l2_leaf_reg'] = 3.5  
#cat_params['border_count'] = 8
#cat_params['gradient_iterations'] = 4
#cat_params['random_state'] = SEED

# Regularized Greedy Forest params
#rgf_params = {}
#rgf_params['max_leaf'] = 2000
#rgf_params['learning_rate'] = 0.5
#rgf_params['algorithm'] = "RGF_Sib"
#rgf_params['test_interval'] = 100
#rgf_params['min_samples_leaf'] = 3 
#rgf_params['reg_depth'] = 1.0
#rgf_params['l2'] = 0.5  
#rgf_params['sl2'] = 0.005


In [ ]:
lgb_model = LGBMClassifier(**lgb_params)

lgb_model2 = LGBMClassifier(**lgb_params2)

lgb_model3 = LGBMClassifier(**lgb_params3)

#rf_model = RandomForestClassifier(**rf_params)

#et_model = ExtraTreesClassifier(**et_params)
#xgb_model = XGBClassifier(**xgb_params)

#cat_model = CatBoostClassifier(**cat_params)

#rgf_model = RGFClassifier(**rgf_params) 

#gb_model = GradientBoostingClassifier(max_depth=5)

#ada_model = AdaBoostClassifier()

log_model = LogisticRegression()


In [ ]:
model_names = ['lgb1','lgb2','lgb3']
base_models = [lgb_model, lgb_model2, lgb_model3]

stack = Ensemble(n_splits=3,
        stacker = log_model,
        base_models = base_models,
        model_names = model_names
In [ ]:
yprobs = stack.fit_predict(df_Xtrain, ser_ytrain, df_Xtest)
score = gini_normalizedc(ser_ytest.to_numpy(), yprobs)
print('normalized gini score ', score)
Model:  lgb1
  Fold 1
  Time taken : 0 hr 0 min 50 secs

  Fold 2
  Time taken : 0 hr 1 min 41 secs

  Fold 3
  Time taken : 0 hr 2 min 33 secs

Model:  lgb2
  Fold 1
  Time taken : 0 hr 3 min 28 secs

  Fold 2
  Time taken : 0 hr 4 min 22 secs

  Fold 3
  Time taken : 0 hr 5 min 17 secs

Model:  lgb3
  Fold 1
  Time taken : 0 hr 6 min 13 secs

  Fold 2
  Time taken : 0 hr 7 min 8 secs

  Fold 3
  Time taken : 0 hr 8 min 4 secs

Stacker AUC: 0.63924
normalized gini score  0.29426379136619024
In [ ]:
df_sub = pd.DataFrame({'id': test_id.to_numpy().ravel(),
                       'target': yprobs})

Out[ ]:
id target
0 1392758 0.033308
1 1273917 0.042591
2 28224 0.027074
3 228253 0.033086
4 641382 0.030880

Time Taken

In [ ]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))