Kernel Author:
Bhishan Poudel, Ph.D Astrophysics .

Data Description¶

In this project, we will predict the probability that an auto insurance policy holder files a claim. This a binary classification problem.

We have more than half a million records and 59 features (including already calculated features).

binary features: _bin
categorical features: _cat
continuous or ordinal feafures: ind, reg, car, calc
missing values: -1

Fullforms
ind = individual
reg = registration
car = car
calc = calculated

The target columns signifies whether or not a claim was filed for that policy holder.

Evaluation Metric¶

From this graph of wikipedia G = A / (A+B). Gini index varies between 0 and 1. Here we have only binary options: rich and poor.

x-axis= number of people (cumulative sum)
y-axis = total income (cumulative sum)

0 = complete equality of richness
1 = complete inequality of richness


This competition
0 = random guessing
1 = maximum score (also remember 2*1-1 = 1 when maximum auc is 1).

If we calculate gini from gini = 2*auc -1 it has range (-1,1). For AUC:

worst binary classifier AUC = 0.5
perfect binary classifier AUC = 1

If AUC is less than below, simply simply invert 0 <==> 1 then we will get roc auc score between 0.5 and 1.0

Imports¶

import os
import time
import gc
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import seaborn as sns
sns.set(color_codes=True)
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint

%matplotlib inline
time_start_notebook = time.time()
SEED=100
print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])

from scipy import sparse as ssp
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('seaborn', '0.10.1'), ('matplotlib', '3.2.2')]

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

# Google colab

%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:

    # deep learning
    !pip install lrcurve

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.

Useful Functions¶

df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                        'NormalizedGini': []
                    })

Load the data¶

df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/'
    'Porto_seguro_safe_driver_prediction/train.csv.zip?raw=true',compression='zip')
print(df.shape)


# for neural nets, make the data small
# df = df.sample(frac=0.01,random_state=SEED)
df.head()

(595212, 59)

"""
Comment about file size:
The data is large, it has 595k records and 59 features.

ps = porto seguro
_bin = binary feature
_cat = categorical feature


continuous or ordinal: ind, reg, car, calc

""";

target = 'target'

Data Processing¶

# all features except target
cols_all= df.columns.drop(target).to_list() 

# categorical features except later created count
cols_cat = [c for c in cols_all if ('cat' in c and 'count' not in c)]

# we exclude calc features in numeric features
cols_num = [c for c in cols_all if ('cat' not in c and 'calc' not in c)]

print(cols_num)

['id', 'ps_ind_01', 'ps_ind_03', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

Train-test Split with Stratify¶

from sklearn.model_selection import train_test_split

df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    df.drop(target,axis=1),df[target],
    test_size=0.2,random_state=SEED, stratify=df[target])

# backup and delete id
cols_drop = ['id']
train_id = df_Xtrain[cols_drop]
test_id = df_Xtest[cols_drop]
df_Xtrain = df_Xtrain.drop(cols_drop,axis=1)
df_Xtest = df_Xtest.drop(cols_drop,axis=1)

Xtrain = df_Xtrain.to_numpy()
ytrain = ser_ytrain.to_numpy().ravel()

Xtest = df_Xtest.to_numpy()
ytest = ser_ytest.to_numpy().ravel()

# make sure no nans and no strings
print(Xtrain.sum().sum())

78071313.37562414

Training Data¶

pd.set_option('display.max_columns',250)
df_Xtrain.head()

# df_Xtrain.columns # make sure there are no id and index

Xtr = Xtrain
Xtx = Xtest
ytr = ytrain
ytx = ytest

print(Xtr.shape, Xtx.shape)

(476169, 57) (119043, 57)

ser_ytest.value_counts(normalize=True)

0    0.963551
1    0.036449
Name: target, dtype: float64

Evaluation Metric¶

https://www.kaggle.com/rshally/porto-xgb-lgb-kfold-lb-0-282

#gini scoring function from kernel at: 
#https://www.kaggle.com/tezdhar/faster-gini-calculation
def ginic(actual, pred):
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_c[-1] - (n + 1) / 2.0
    return giniSum / n
 
def gini_normalizedc(a, p):
    return ginic(a, p) / ginic(a, a)

Data processing¶

# remove calc features
cols_use = [c for c in df_Xtrain.columns if (not c.startswith('ps_calc_'))]

df_Xtrain = df_Xtrain[cols_use]
df_Xtest = df_Xtest[cols_use]

Build embedding network¶

# if nunique is >2 make embed dict of cats
cols_cat = [i for i in df_Xtrain.columns if i.endswith('_cat')]
# print(cols_cat)

df_emb = df_Xtrain[cols_cat].nunique().loc[lambda x: x>2].to_frame('nunique')
# df_emb

df_emb['nunique'].values

array([  5,   3,   8,  13,   3,   3,  10,   3,  18,   3,   6,   3, 104])

df_emb['size'] = [ (5,3),(3,2),(8,5),(13,7), (3,2),
                (3,2), (10,5), (3,2), (18,8), (3,2),
                (6,3), (3,2), (104,10) ]

df_emb

dict_emb = df_emb['size'].to_dict()
dict_emb

{'ps_car_01_cat': (13, 7),
 'ps_car_02_cat': (3, 2),
 'ps_car_03_cat': (3, 2),
 'ps_car_04_cat': (10, 5),
 'ps_car_05_cat': (3, 2),
 'ps_car_06_cat': (18, 8),
 'ps_car_07_cat': (3, 2),
 'ps_car_09_cat': (6, 3),
 'ps_car_10_cat': (3, 2),
 'ps_car_11_cat': (104, 10),
 'ps_ind_02_cat': (5, 3),
 'ps_ind_04_cat': (3, 2),
 'ps_ind_05_cat': (8, 5)}

def build_embedding_network():
    """Build the embedding network.

    Parameters
    -----------
    dict_emb: embedding dict eg. {'mycol': (10,8), 'mycols2': (3,2)} 
              mycol has originally 8 unique categorical values but
              we want to embed 8 dimensional space.

    Usage
    ------
    NN = build_embedding_network()
    NN.fit(proc_Xtr, ser_ytr.values)

    """
    inputs = []
    embeddings = []

    for key in dict_emb.keys():
        input = Input(shape=(1,))
        x,y = dict_emb[key]
        embedding = Embedding(x, y, input_length=1)(input)
        embedding = Reshape(target_shape=(y,))(embedding)
        inputs.append(input)
        embeddings.append(embedding)

    input_numeric = Input(shape=(24,))
    embedding_numeric = Dense(16)(input_numeric) 
    inputs.append(input_numeric)
    embeddings.append(embedding_numeric)

    x = Concatenate()(embeddings)
    x = Dense(80, activation='relu')(x)
    x = Dropout(.35)(x)
    x = Dense(20, activation='relu')(x)
    x = Dropout(.15)(x)
    x = Dense(10, activation='relu')(x)
    x = Dropout(.15)(x)
    output = Dense(1, activation='sigmoid')(x)

    model = Model(inputs, output)

    model.compile(loss='binary_crossentropy', optimizer='adam')

    return model

# converting data to list format to match the network structure
def preproc(df_Xtr, df_Xvd, df_Xtx):
    """Preprocessing data for neural network fitting.

    Parameters
    -----------
    df_Xtr: training dataframe
    df_Xvd: validation dataframe
    df_Xtx: test dataframe
    dict_emb: embedding dict eg. {'mycol': (10,8), 'mycols2': (3,2)} 
              mycol has originally 8 unique categorical values but
              we want to embed 8 dimensional space.

    Usage
    -----
    proc_Xtr, proc_Xvd, proc_Xtx = preproc(df_Xtr,df_Xvd, df_Xtx)
    NN = build_embedding_network()
    NN.fit(proc_Xtr, ser_ytr.values)

    """
    input_list_train = []
    input_list_val = []
    input_list_test = []
 
    # the cols to be embedded: rescaling to range [0, # values)
    for c in dict_emb.keys():
        raw_vals = np.unique(df_Xtr[c])
        val_map = {}
        for i in range(len(raw_vals)):
            val_map[raw_vals[i]] = i       
        input_list_train.append(df_Xtr[c].map(val_map).values)
        input_list_val.append(df_Xvd[c].map(val_map).fillna(0).values)
        input_list_test.append(df_Xtx[c].map(val_map).fillna(0).values)

    # the rest of the columns
    other_cols = [c for c in df_Xtr.columns if (not c in dict_emb.keys())]
    input_list_train.append(df_Xtr[other_cols].values)
    input_list_val.append(df_Xvd[other_cols].values)
    input_list_test.append(df_Xtx[other_cols].values)
    
    return input_list_train, input_list_val, input_list_test

Modelling: Keras Entity Embeddings¶

https://www.kaggle.com/aquatic/entity-embedding-neural-net

from keras.models import Model
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout
from keras.layers.embeddings import Embedding

from sklearn.model_selection import StratifiedKFold

from lrcurve import KerasLearningCurve

K = 5 # number of folds
runs_per_fold = 3
n_epochs = 15 # make it large eg. 100 
cv_ginis = []
trprobs = np.zeros(np.shape(df_Xtrain)[0]) # Ntrain rows (full validation set)
txprobs = np.zeros((np.shape(df_Xtest)[0],K)) # Ntest rows, K columns

skf = StratifiedKFold(n_splits=K, random_state=SEED, shuffle=True) 

time_start = time.time()
for i, (idx_tr, idx_vd) in enumerate(skf.split(df_Xtrain.to_numpy(),
                                               ser_ytrain.to_numpy())):
    # print
    print( "\nFold ", i)

    # data for this fold
    df_Xtr = df_Xtrain.iloc[idx_tr,:].copy()
    ser_ytr = ser_ytrain.iloc[idx_tr].copy()
    df_Xvd = df_Xtrain.iloc[idx_vd,:].copy()
    ser_yvd = ser_ytrain.iloc[idx_vd].copy()
    df_Xtx = df_Xtest.copy()

    # upsampling 
    pos = (ser_ytr == 1)
    
    # add positive examples
    df_Xtr  = pd.concat([df_Xtr, df_Xtr.loc[pos]], axis=0)
    ser_ytr = pd.concat([ser_ytr, ser_ytr.loc[pos]], axis=0)
    
    # shuffle data
    idx = np.arange(len(df_Xtr))
    np.random.shuffle(idx)
    df_Xtr = df_Xtr.iloc[idx]
    ser_ytr = ser_ytr.iloc[idx]

    # preprocessing
    proc_Xtr, proc_Xvd, proc_Xtx = preproc(df_Xtr,df_Xvd, df_Xtx)

    # track oof prediction for cv scores
    vdprobs = 0 # we must init it

    for j in range(runs_per_fold):
    
        NN = build_embedding_network()
        
        NN.fit(proc_Xtr, ser_ytr.values,
               validation_data = (proc_Xvd, ser_yvd.to_numpy()),
               callbacks=[KerasLearningCurve()],
               epochs=n_epochs,batch_size=4096, verbose=0)

        vdprobs += NN.predict(proc_Xvd)[:,0] / runs_per_fold
        txprobs[:,i] += NN.predict(proc_Xtx)[:,0] / runs_per_fold

    trprobs[idx_vd] += vdprobs # train is the full validation set
    cv_gini = gini_normalizedc(ser_yvd.values, vdprobs)
    cv_ginis.append(cv_gini)

    print (f'\n  cv gini: {cv_gini:.5f}')

    # clean memory
    del df_Xtr, df_Xvd, df_Xtx, proc_Xtr, proc_Xvd, proc_Xtx

    # time taken
    time_taken = time.time() - time_start
    h,m = divmod(time_taken,60*60)
    print('  Time taken        : {:.0f} hr '\
        '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

# outside the loop
txprobs = np.mean(txprobs, axis=1)

print('Mean out of fold gini: %.5f' % np.mean(cv_ginis))
print('Full validation gini: %.5f' % gini_normalizedc(ser_ytrain.values,
                                                      trprobs))

Mean out of fold gini: 0.27109
Full validation gini: 0.27008

# df_sub = pd.DataFrame({'id' : test_id, 'target' : txprobs})
# df_sub.to_csv('NN_EntityEmbed_10fold-sub.csv', index=False)

	id	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	ps_ind_15	ps_ind_16_bin	ps_ind_17_bin	ps_ind_18_bin	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_01_cat	ps_car_02_cat	ps_car_03_cat	ps_car_05_cat	ps_car_06_cat	ps_car_07_cat	ps_car_08_cat	ps_car_09_cat	ps_car_10_cat	ps_car_11_cat	ps_car_11	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	2	2	5	1	0	1	0	11	0	1	0	0.7	0.2	0.718070	10	1	-1	1	4	1	0	0	1	12	2	0.400000	0.883679	0.370810	3.605551	0.6	0.5	0.2	3	1	10	1	10	1	5	9	1	5	8	1	1	0	0	1
1	9	1	1	7	0	0	0	1	3	0	0	1	0.8	0.4	0.766078	11	1	-1	-1	11	1	1	2	1	19	3	0.316228	0.618817	0.388716	2.449490	0.3	0.1	0.3	2	1	9	5	8	1	7	3	1	1	9	1	1	0	1	0
2	13	5	4	9	1	0	0	1	12	1	0	0	0.0	0.0	-1.000000	7	1	-1	-1	14	1	1	2	1	60	1	0.316228	0.641586	0.347275	3.316625	0.5	0.7	0.1	2	2	9	1	8	2	7	4	2	7	7	1	1	0	1	0
3	16	0	1	2	0	1	0	0	8	1	0	0	0.9	0.2	0.580948	7	1	0	1	11	1	1	3	1	104	1	0.374166	0.542949	0.294958	2.000000	0.6	0.9	0.1	2	4	7	1	8	4	2	2	2	4	9	0	0	0	0	0
4	17	0	2	0	1	1	0	0	9	1	0	0	0.7	0.6	0.840759	11	1	-1	-1	14	1	1	2	1	82	3	0.316070	0.565832	0.365103	2.000000	0.4	0.6	0.0	2	2	6	3	10	2	12	3	1	1	3	0	0	1	1	0

	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_05_cat	ps_ind_06_bin	ps_ind_08_bin	ps_ind_09_bin	ps_ind_15	ps_ind_16_bin	ps_ind_18_bin	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_01_cat	ps_car_02_cat	ps_car_03_cat	ps_car_05_cat	ps_car_06_cat	ps_car_07_cat	ps_car_08_cat	ps_car_09_cat	ps_car_10_cat	ps_car_11_cat	ps_car_11	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_20_bin
422636	0	1	6	0	2	1	0	0	12	1	0	0.9	0.2	0.422788	11	1	-1	-1	14	1	1	2	1	82	3	0.316228	0.704575	0.368511	3.316625	0.6	0.6	0.9	4	1	6	4	11	2	5	8	4	7	5	1	1	1	0
374646	1	1	5	0	0	0	1	0	3	0	1	0.6	0.5	0.844837	7	1	-1	-1	11	1	1	2	1	11	2	0.316228	0.709149	0.368782	3.605551	0.2	0.2	0.2	2	3	7	1	10	3	14	8	2	6	7	0	1	0	0
380900	5	1	4	0	0	0	1	0	7	1	0	0.8	0.3	1.114114	11	1	-1	-1	1	1	1	2	1	51	2	0.374166	0.837845	0.401746	3.605551	0.2	0.3	0.4	2	3	8	4	11	2	9	10	2	4	13	1	0	0	0
318036	5	1	8	1	4	0	0	1	6	1	0	0.4	0.6	0.841130	11	1	-1	-1	11	1	1	2	1	104	3	0.447214	0.817862	0.424617	3.000000	0.5	0.6	0.3	1	4	8	7	7	5	15	4	2	1	10	0	0	0	0
7042	0	1	3	0	0	1	0	0	0	1	0	0.6	0.4	0.809707	11	0	-1	-1	11	1	1	2	1	30	3	0.446990	0.859379	0.451110	2.828427	0.7	0.7	0.8	3	3	8	3	9	3	4	8	2	2	7	0	1	0	1

	nunique	size
ps_ind_02_cat	5	(5, 3)
ps_ind_04_cat	3	(3, 2)
ps_ind_05_cat	8	(8, 5)
ps_car_01_cat	13	(13, 7)
ps_car_02_cat	3	(3, 2)
ps_car_03_cat	3	(3, 2)
ps_car_04_cat	10	(10, 5)
ps_car_05_cat	3	(3, 2)
ps_car_06_cat	18	(18, 8)
ps_car_07_cat	3	(3, 2)
ps_car_09_cat	6	(6, 3)
ps_car_10_cat	3	(3, 2)
ps_car_11_cat	104	(104, 10)