Data Description

In this project, we will predict the probability that an auto insurance policy holder files a claim. This a binary classification problem.

We have more than half a million records and 59 features (including already calculated features).

binary features: _bin
categorical features: _cat
continuous or ordinal feafures: ind, reg, car, calc
missing values: -1

Fullforms
ind = individual
reg = registration
car = car
calc = calculated

The target columns signifies whether or not a claim was filed for that policy holder.

Evaluation Metric

From this graph of wikipedia G = A / (A+B). Gini index varies between 0 and 1. Here we have only binary options: rich and poor.

x-axis= number of people (cumulative sum)
y-axis = total income (cumulative sum)

0 = complete equality of richness
1 = complete inequality of richness


This competition
0 = random guessing
1 = maximum score (also remember 2*1-1 = 1 when maximum auc is 1).

If we calculate gini from gini = 2*auc -1 it has range (-1,1). For AUC:

worst binary classifier AUC = 0.5
perfect binary classifier AUC = 1

If AUC is less than below, simply simply invert 0 <==> 1 then we will get roc auc score between 0.5 and 1.0

Imports

In [1]:
import os
import time
import gc
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import seaborn as sns
sns.set(color_codes=True)
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint

%matplotlib inline
time_start_notebook = time.time()
SEED=100
print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])

from scipy import sparse as ssp
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('seaborn', '0.10.1'), ('matplotlib', '3.2.2')]
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [2]:
# Google colab
In [26]:
%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:

    # deep learning
    !pip install lrcurve

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.

Useful Functions

In [4]:
df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                        'NormalizedGini': []
                    })

Load the data

In [5]:
df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/'
    'Porto_seguro_safe_driver_prediction/train.csv.zip?raw=true',compression='zip')
print(df.shape)


# for neural nets, make the data small
# df = df.sample(frac=0.01,random_state=SEED)
df.head()
(595212, 59)
Out[5]:
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 0 0 0 0 0 0 11 0 1 0 0.7 0.2 0.718070 10 1 -1 0 1 4 1 0 0 1 12 2 0.400000 0.883679 0.370810 3.605551 0.6 0.5 0.2 3 1 10 1 10 1 5 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 0 0 0 0 0 0 3 0 0 1 0.8 0.4 0.766078 11 1 -1 0 -1 11 1 1 2 1 19 3 0.316228 0.618817 0.388716 2.449490 0.3 0.1 0.3 2 1 9 5 8 1 7 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 0 0 0 0 0 0 12 1 0 0 0.0 0.0 -1.000000 7 1 -1 0 -1 14 1 1 2 1 60 1 0.316228 0.641586 0.347275 3.316625 0.5 0.7 0.1 2 2 9 1 8 2 7 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 0 0 0 0 0 0 8 1 0 0 0.9 0.2 0.580948 7 1 0 0 1 11 1 1 3 1 104 1 0.374166 0.542949 0.294958 2.000000 0.6 0.9 0.1 2 4 7 1 8 4 2 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 0 0 0 0 0 0 9 1 0 0 0.7 0.6 0.840759 11 1 -1 0 -1 14 1 1 2 1 82 3 0.316070 0.565832 0.365103 2.000000 0.4 0.6 0.0 2 2 6 3 10 2 12 3 1 1 3 0 0 0 1 1 0
In [6]:
"""
Comment about file size:
The data is large, it has 595k records and 59 features.

ps = porto seguro
_bin = binary feature
_cat = categorical feature


continuous or ordinal: ind, reg, car, calc

""";
In [7]:
target = 'target'

Data Processing

In [8]:
# all features except target
cols_all= df.columns.drop(target).to_list() 

# categorical features except later created count
cols_cat = [c for c in cols_all if ('cat' in c and 'count' not in c)]

# we exclude calc features in numeric features
cols_num = [c for c in cols_all if ('cat' not in c and 'calc' not in c)]

print(cols_num)
['id', 'ps_ind_01', 'ps_ind_03', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

Train-test Split with Stratify

In [9]:
from sklearn.model_selection import train_test_split

df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    df.drop(target,axis=1),df[target],
    test_size=0.2,random_state=SEED, stratify=df[target])

# backup and delete id
cols_drop = ['id']
train_id = df_Xtrain[cols_drop]
test_id = df_Xtest[cols_drop]
df_Xtrain = df_Xtrain.drop(cols_drop,axis=1)
df_Xtest = df_Xtest.drop(cols_drop,axis=1)

Xtrain = df_Xtrain.to_numpy()
ytrain = ser_ytrain.to_numpy().ravel()

Xtest = df_Xtest.to_numpy()
ytest = ser_ytest.to_numpy().ravel()

# make sure no nans and no strings
print(Xtrain.sum().sum())
78071313.37562414

Training Data

In [10]:
pd.set_option('display.max_columns',250)
df_Xtrain.head()
Out[10]:
ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
422636 0 1 6 0 2 1 0 0 0 0 0 0 0 0 12 1 0 0 0.9 0.2 0.422788 11 1 -1 0 -1 14 1 1 2 1 82 3 0.316228 0.704575 0.368511 3.316625 0.6 0.6 0.9 4 1 6 4 11 2 5 8 4 7 5 0 1 1 1 0 0
374646 1 1 5 0 0 0 0 1 0 0 0 0 0 0 3 0 0 1 0.6 0.5 0.844837 7 1 -1 0 -1 11 1 1 2 1 11 2 0.316228 0.709149 0.368782 3.605551 0.2 0.2 0.2 2 3 7 1 10 3 14 8 2 6 7 0 0 1 0 0 0
380900 5 1 4 0 0 0 0 1 0 0 0 0 0 0 7 1 0 0 0.8 0.3 1.114114 11 1 -1 0 -1 1 1 1 2 1 51 2 0.374166 0.837845 0.401746 3.605551 0.2 0.3 0.4 2 3 8 4 11 2 9 10 2 4 13 0 1 0 0 0 0
318036 5 1 8 1 4 0 0 0 1 0 0 0 0 0 6 1 0 0 0.4 0.6 0.841130 11 1 -1 0 -1 11 1 1 2 1 104 3 0.447214 0.817862 0.424617 3.000000 0.5 0.6 0.3 1 4 8 7 7 5 15 4 2 1 10 0 0 0 0 0 0
7042 0 1 3 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0.6 0.4 0.809707 11 0 -1 0 -1 11 1 1 2 1 30 3 0.446990 0.859379 0.451110 2.828427 0.7 0.7 0.8 3 3 8 3 9 3 4 8 2 2 7 0 0 1 0 0 1
In [11]:
# df_Xtrain.columns # make sure there are no id and index
In [12]:
Xtr = Xtrain
Xtx = Xtest
ytr = ytrain
ytx = ytest

print(Xtr.shape, Xtx.shape)
(476169, 57) (119043, 57)
In [13]:
ser_ytest.value_counts(normalize=True)
Out[13]:
0    0.963551
1    0.036449
Name: target, dtype: float64
In [14]:
#gini scoring function from kernel at: 
#https://www.kaggle.com/tezdhar/faster-gini-calculation
def ginic(actual, pred):
    n = len(actual)
    a_s = actual[np.argsort(pred)]
    a_c = a_s.cumsum()
    giniSum = a_c.sum() / a_c[-1] - (n + 1) / 2.0
    return giniSum / n
 
def gini_normalizedc(a, p):
    return ginic(a, p) / ginic(a, a)

Data processing

In [15]:
# remove calc features
cols_use = [c for c in df_Xtrain.columns if (not c.startswith('ps_calc_'))]

df_Xtrain = df_Xtrain[cols_use]
df_Xtest = df_Xtest[cols_use]

Build embedding network

In [16]:
# if nunique is >2 make embed dict of cats
cols_cat = [i for i in df_Xtrain.columns if i.endswith('_cat')]
# print(cols_cat)

df_emb = df_Xtrain[cols_cat].nunique().loc[lambda x: x>2].to_frame('nunique')
# df_emb
In [17]:
df_emb['nunique'].values
Out[17]:
array([  5,   3,   8,  13,   3,   3,  10,   3,  18,   3,   6,   3, 104])
In [18]:
df_emb['size'] = [ (5,3),(3,2),(8,5),(13,7), (3,2),
                (3,2), (10,5), (3,2), (18,8), (3,2),
                (6,3), (3,2), (104,10) ]

df_emb
Out[18]:
nunique size
ps_ind_02_cat 5 (5, 3)
ps_ind_04_cat 3 (3, 2)
ps_ind_05_cat 8 (8, 5)
ps_car_01_cat 13 (13, 7)
ps_car_02_cat 3 (3, 2)
ps_car_03_cat 3 (3, 2)
ps_car_04_cat 10 (10, 5)
ps_car_05_cat 3 (3, 2)
ps_car_06_cat 18 (18, 8)
ps_car_07_cat 3 (3, 2)
ps_car_09_cat 6 (6, 3)
ps_car_10_cat 3 (3, 2)
ps_car_11_cat 104 (104, 10)
In [19]:
dict_emb = df_emb['size'].to_dict()
dict_emb
Out[19]:
{'ps_car_01_cat': (13, 7),
 'ps_car_02_cat': (3, 2),
 'ps_car_03_cat': (3, 2),
 'ps_car_04_cat': (10, 5),
 'ps_car_05_cat': (3, 2),
 'ps_car_06_cat': (18, 8),
 'ps_car_07_cat': (3, 2),
 'ps_car_09_cat': (6, 3),
 'ps_car_10_cat': (3, 2),
 'ps_car_11_cat': (104, 10),
 'ps_ind_02_cat': (5, 3),
 'ps_ind_04_cat': (3, 2),
 'ps_ind_05_cat': (8, 5)}
In [20]:
def build_embedding_network():
    """Build the embedding network.

    Parameters
    -----------
    dict_emb: embedding dict eg. {'mycol': (10,8), 'mycols2': (3,2)} 
              mycol has originally 8 unique categorical values but
              we want to embed 8 dimensional space.

    Usage
    ------
    NN = build_embedding_network()
    NN.fit(proc_Xtr, ser_ytr.values)

    """
    inputs = []
    embeddings = []

    for key in dict_emb.keys():
        input = Input(shape=(1,))
        x,y = dict_emb[key]
        embedding = Embedding(x, y, input_length=1)(input)
        embedding = Reshape(target_shape=(y,))(embedding)
        inputs.append(input)
        embeddings.append(embedding)

    input_numeric = Input(shape=(24,))
    embedding_numeric = Dense(16)(input_numeric) 
    inputs.append(input_numeric)
    embeddings.append(embedding_numeric)

    x = Concatenate()(embeddings)
    x = Dense(80, activation='relu')(x)
    x = Dropout(.35)(x)
    x = Dense(20, activation='relu')(x)
    x = Dropout(.15)(x)
    x = Dense(10, activation='relu')(x)
    x = Dropout(.15)(x)
    output = Dense(1, activation='sigmoid')(x)

    model = Model(inputs, output)

    model.compile(loss='binary_crossentropy', optimizer='adam')

    return model
In [21]:
# converting data to list format to match the network structure
def preproc(df_Xtr, df_Xvd, df_Xtx):
    """Preprocessing data for neural network fitting.

    Parameters
    -----------
    df_Xtr: training dataframe
    df_Xvd: validation dataframe
    df_Xtx: test dataframe
    dict_emb: embedding dict eg. {'mycol': (10,8), 'mycols2': (3,2)} 
              mycol has originally 8 unique categorical values but
              we want to embed 8 dimensional space.

    Usage
    -----
    proc_Xtr, proc_Xvd, proc_Xtx = preproc(df_Xtr,df_Xvd, df_Xtx)
    NN = build_embedding_network()
    NN.fit(proc_Xtr, ser_ytr.values)

    """
    input_list_train = []
    input_list_val = []
    input_list_test = []
 
    # the cols to be embedded: rescaling to range [0, # values)
    for c in dict_emb.keys():
        raw_vals = np.unique(df_Xtr[c])
        val_map = {}
        for i in range(len(raw_vals)):
            val_map[raw_vals[i]] = i       
        input_list_train.append(df_Xtr[c].map(val_map).values)
        input_list_val.append(df_Xvd[c].map(val_map).fillna(0).values)
        input_list_test.append(df_Xtx[c].map(val_map).fillna(0).values)

    # the rest of the columns
    other_cols = [c for c in df_Xtr.columns if (not c in dict_emb.keys())]
    input_list_train.append(df_Xtr[other_cols].values)
    input_list_val.append(df_Xvd[other_cols].values)
    input_list_test.append(df_Xtx[other_cols].values)
    
    return input_list_train, input_list_val, input_list_test 
In [28]:
from keras.models import Model
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout
from keras.layers.embeddings import Embedding

from sklearn.model_selection import StratifiedKFold

from lrcurve import KerasLearningCurve
In [30]:
K = 5 # number of folds
runs_per_fold = 3
n_epochs = 15 # make it large eg. 100 
cv_ginis = []
trprobs = np.zeros(np.shape(df_Xtrain)[0]) # Ntrain rows (full validation set)
txprobs = np.zeros((np.shape(df_Xtest)[0],K)) # Ntest rows, K columns

skf = StratifiedKFold(n_splits=K, random_state=SEED, shuffle=True) 

time_start = time.time()
for i, (idx_tr, idx_vd) in enumerate(skf.split(df_Xtrain.to_numpy(),
                                               ser_ytrain.to_numpy())):
    # print
    print( "\nFold ", i)

    # data for this fold
    df_Xtr = df_Xtrain.iloc[idx_tr,:].copy()
    ser_ytr = ser_ytrain.iloc[idx_tr].copy()
    df_Xvd = df_Xtrain.iloc[idx_vd,:].copy()
    ser_yvd = ser_ytrain.iloc[idx_vd].copy()
    df_Xtx = df_Xtest.copy()

    # upsampling 
    pos = (ser_ytr == 1)
    
    # add positive examples
    df_Xtr  = pd.concat([df_Xtr, df_Xtr.loc[pos]], axis=0)
    ser_ytr = pd.concat([ser_ytr, ser_ytr.loc[pos]], axis=0)
    
    # shuffle data
    idx = np.arange(len(df_Xtr))
    np.random.shuffle(idx)
    df_Xtr = df_Xtr.iloc[idx]
    ser_ytr = ser_ytr.iloc[idx]

    # preprocessing
    proc_Xtr, proc_Xvd, proc_Xtx = preproc(df_Xtr,df_Xvd, df_Xtx)

    # track oof prediction for cv scores
    vdprobs = 0 # we must init it

    for j in range(runs_per_fold):
    
        NN = build_embedding_network()
        
        NN.fit(proc_Xtr, ser_ytr.values,
               validation_data = (proc_Xvd, ser_yvd.to_numpy()),
               callbacks=[KerasLearningCurve()],
               epochs=n_epochs,batch_size=4096, verbose=0)

        vdprobs += NN.predict(proc_Xvd)[:,0] / runs_per_fold
        txprobs[:,i] += NN.predict(proc_Xtx)[:,0] / runs_per_fold

    trprobs[idx_vd] += vdprobs # train is the full validation set
    cv_gini = gini_normalizedc(ser_yvd.values, vdprobs)
    cv_ginis.append(cv_gini)

    print (f'\n  cv gini: {cv_gini:.5f}')

    # clean memory
    del df_Xtr, df_Xvd, df_Xtx, proc_Xtr, proc_Xvd, proc_Xtx

    # time taken
    time_taken = time.time() - time_start
    h,m = divmod(time_taken,60*60)
    print('  Time taken        : {:.0f} hr '\
        '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

# outside the loop
txprobs = np.mean(txprobs, axis=1)

print('Mean out of fold gini: %.5f' % np.mean(cv_ginis))
print('Full validation gini: %.5f' % gini_normalizedc(ser_ytrain.values,
                                                      trprobs))
Mean out of fold gini: 0.27109
Full validation gini: 0.27008
In [ ]:
# df_sub = pd.DataFrame({'id' : test_id, 'target' : txprobs})
# df_sub.to_csv('NN_EntityEmbed_10fold-sub.csv', index=False)
In [ ]: