In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. The task is to predict the "Response" variable for each Id in the test set. "Response" is an ordinal measure of risk that has 8 levels.

File description

  • train.csv - the training set, contains the Response values
  • test.csv - the test set, you must predict the Response variable for all rows in this file
  • sample_submission.csv - a sample submission file in the correct format
**Data fields**
Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for
Ins_Age Normalized age of applicant
Ht Normalized height of applicant
Wt Normalized weight of applicant
BMI Normalized BMI of applicant
Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6 A set of normalized variables providing information about the applicant.
Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

Imports

In [1]:
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True,font_scale=1.5)

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline


# random state
SEED = 100
time_start_notebook = time.time()

# Jupyter notebook settings for pandas
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 50)

import scipy
from scipy import stats
import IPython
from IPython.display import display
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
In [2]:
# Google colab
In [3]:
%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    ### mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    ### load the data dir
    dat_dir = 'drive/My Drive/Colab Notebooks/data/'
    sys.path.append(dat_dir)

    ### Image dir
    img_dir = 'drive/My Drive/Colab Notebooks/images/'
    if not os.path.isdir(img_dir): os.makedirs(img_dir)
    sys.path.append(img_dir)

    ### Output dir
    out_dir = 'drive/My Drive/Colab Notebooks/outputs/'
    if not os.path.isdir(out_dir): os.makedirs(out_dir)
    sys.path.append(out_dir)

    ### Also install my custom module
    module_dir = 'drive/My Drive/Colab Notebooks/Bhishan_Modules/' 
    sys.path.append(module_dir)
    !cd drive/My Drive/Colab Notebooks/Bhishan_Modules/
    !pip install -e bhishan
    !cd -

    # pandas profiling
    !pip install -U pandas-profiling # we need restart
    import pandas_profiling

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.
In [4]:
import sklearn
import pandas_profiling
In [7]:
import bhishan
from bhishan import bp

print(bhishan.__version__)
0.3.1
In [6]:
%load_ext watermark
%watermark -a "Bhishan Poudel" -dvm
%watermark -iv
Bhishan Poudel 2020-06-22 

CPython 3.7.7
IPython 7.13.0

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
seaborn          0.10.1
numpy            1.18.4
pandas           1.0.3
pandas_profiling 2.8.0
json             2.0.9
matplotlib       3.2.1
IPython          7.13.0
sklearn          0.23.1
scipy            1.4.1
autopep8         1.5.2

In [8]:
%load_ext autoreload
In [9]:
%autoreload 2

Load the data

In [10]:
df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/Prudential_Insurance/raw/train.csv.zip?raw=true',compression='zip')
print(df.shape)
df.head()
(59381, 128)
Out[10]:
Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht Wt BMI Employment_Info_1 Employment_Info_2 Employment_Info_3 Employment_Info_4 Employment_Info_5 Employment_Info_6 InsuredInfo_1 InsuredInfo_2 InsuredInfo_3 InsuredInfo_4 InsuredInfo_5 InsuredInfo_6 InsuredInfo_7 Insurance_History_1 Insurance_History_2 Insurance_History_3 Insurance_History_4 Insurance_History_5 Insurance_History_7 Insurance_History_8 Insurance_History_9 Family_Hist_1 Family_Hist_2 Family_Hist_3 Family_Hist_4 Family_Hist_5 Medical_History_1 Medical_History_2 Medical_History_3 Medical_History_4 Medical_History_5 Medical_History_6 Medical_History_7 Medical_History_8 Medical_History_9 Medical_History_10 Medical_History_11 Medical_History_12 Medical_History_13 Medical_History_14 Medical_History_15 Medical_History_16 Medical_History_17 Medical_History_18 Medical_History_19 Medical_History_20 Medical_History_21 Medical_History_22 Medical_History_23 Medical_History_24 Medical_History_25 Medical_History_26 Medical_History_27 Medical_History_28 Medical_History_29 Medical_History_30 Medical_History_31 Medical_History_32 Medical_History_33 Medical_History_34 Medical_History_35 Medical_History_36 Medical_History_37 Medical_History_38 Medical_History_39 Medical_History_40 Medical_History_41 Medical_Keyword_1 Medical_Keyword_2 Medical_Keyword_3 Medical_Keyword_4 Medical_Keyword_5 Medical_Keyword_6 Medical_Keyword_7 Medical_Keyword_8 Medical_Keyword_9 Medical_Keyword_10 Medical_Keyword_11 Medical_Keyword_12 Medical_Keyword_13 Medical_Keyword_14 Medical_Keyword_15 Medical_Keyword_16 Medical_Keyword_17 Medical_Keyword_18 Medical_Keyword_19 Medical_Keyword_20 Medical_Keyword_21 Medical_Keyword_22 Medical_Keyword_23 Medical_Keyword_24 Medical_Keyword_25 Medical_Keyword_26 Medical_Keyword_27 Medical_Keyword_28 Medical_Keyword_29 Medical_Keyword_30 Medical_Keyword_31 Medical_Keyword_32 Medical_Keyword_33 Medical_Keyword_34 Medical_Keyword_35 Medical_Keyword_36 Medical_Keyword_37 Medical_Keyword_38 Medical_Keyword_39 Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response
0 2 1 D3 10 0.076923 2 1 1 0.641791 0.581818 0.148536 0.323008 0.028 12 1 0.0 3 NaN 1 2 6 3 1 2 1 1 1 3 1 0.000667 1 1 2 2 NaN 0.598039 NaN 0.526786 4.0 112 2 1 1 3 2 2 1 NaN 3 2 3 3 240.0 3 3 1 1 2 1 2 3 NaN 1 3 3 1 3 2 3 NaN 1 3 1 2 2 1 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
1 5 1 A1 26 0.076923 2 3 1 0.059701 0.600000 0.131799 0.272288 0.000 1 3 0.0 2 0.0018 1 2 6 3 1 2 1 2 1 3 1 0.000133 1 3 2 2 0.188406 NaN 0.084507 NaN 5.0 412 2 1 1 3 2 2 1 NaN 3 2 3 3 0.0 1 3 1 1 2 1 2 3 NaN 1 3 3 1 3 2 3 NaN 3 1 1 2 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
2 6 1 E1 26 0.076923 2 3 1 0.029851 0.745455 0.288703 0.428780 0.030 9 1 0.0 2 0.0300 1 2 8 3 1 1 1 2 1 1 3 NaN 3 2 3 3 0.304348 NaN 0.225352 NaN 10.0 3 2 2 1 3 2 2 2 NaN 3 2 3 3 NaN 1 3 1 1 2 1 2 3 NaN 2 2 3 1 3 2 3 NaN 3 3 1 3 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
3 7 1 D4 10 0.487179 2 3 1 0.164179 0.672727 0.205021 0.352438 0.042 9 1 0.0 3 0.2000 2 2 8 3 1 2 1 2 1 1 3 NaN 3 2 3 3 0.420290 NaN 0.352113 NaN 0.0 350 2 2 1 3 2 2 2 NaN 3 2 3 3 NaN 1 3 1 1 2 2 2 3 NaN 1 3 3 1 3 2 3 NaN 3 3 1 2 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
4 8 1 D2 26 0.230769 2 3 1 0.417910 0.654545 0.234310 0.424046 0.027 9 1 0.0 2 0.0500 1 2 6 3 1 2 1 2 1 1 3 NaN 3 2 3 2 0.463768 NaN 0.408451 NaN NaN 162 2 2 1 3 2 2 2 NaN 3 2 3 3 NaN 1 3 1 1 2 1 2 3 NaN 2 2 3 1 3 2 3 NaN 3 3 1 3 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
In [11]:
target = 'Response'

Miscellaneous Features

In [12]:
df_misc = df[['Ins_Age','Ht','Wt','BMI']]
print(df_misc.shape)
df_misc.head()
(59381, 4)
Out[12]:
Ins_Age Ht Wt BMI
0 0.641791 0.581818 0.148536 0.323008
1 0.059701 0.600000 0.131799 0.272288
2 0.029851 0.745455 0.288703 0.428780
3 0.164179 0.672727 0.205021 0.352438
4 0.417910 0.654545 0.234310 0.424046
In [13]:
df_misc.bp.get_column_descriptions()
Out[13]:
Feature Type Count Unique Missing MissingPct Zeros ZerosPct count mean std min 25% 50% 75% max
0 Ins_Age float64 59381 65 0 0.000000 287 0.480000 59381.000000 0.405567 0.197190 0.000000 0.238806 0.402985 0.567164 1.000000
1 Ht float64 59381 39 0 0.000000 1 0.000000 59381.000000 0.707283 0.074239 0.000000 0.654545 0.709091 0.763636 1.000000
2 Wt float64 59381 300 0 0.000000 1 0.000000 59381.000000 0.292587 0.089037 0.000000 0.225941 0.288703 0.345188 1.000000
3 BMI float64 59381 3256 0 0.000000 1 0.000000 59381.000000 0.469462 0.122213 0.000000 0.385517 0.451349 0.532858 1.000000
In [14]:
df_misc.bp.missing()
Number of missing values features: 0
cols_missing_high = []
cols_missing_low = []
Out[14]:
Feature Type Count Missing Zeros Unique MissingPct ZerosPct count mean std min 25% 50% 75% max
In [15]:
df_misc.bp.plot_corr()
In [16]:
sns.scatterplot(x = 'Ins_Age', y = 'BMI',hue=target,  data = df);
In [17]:
df_misc.merge(df[target],left_index=True,right_index=True).bp.plot_corr()

Product Features

In [18]:
df_product = df[df.columns[df.columns.str.startswith('Product')]]
print(df_product.shape)
df_product.head()
(59381, 7)
Out[18]:
Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7
0 1 D3 10 0.076923 2 1 1
1 1 A1 26 0.076923 2 3 1
2 1 E1 26 0.076923 2 3 1
3 1 D4 10 0.487179 2 3 1
4 1 D2 26 0.230769 2 3 1
In [19]:
df_product.bp.get_column_descriptions(style=True)
Out[19]:
Feature Type Count Unique Missing MissingPct Zeros ZerosPct count mean std min 25% 50% 75% max
3 Product_Info_4 float64 59381 1491 0 0.000000 1036 1.740000 59381.000000 0.328952 0.282562 0.000000 0.076923 0.230769 0.487179 1.000000
0 Product_Info_1 int64 59381 2 0 0.000000 0 0.000000 59381.000000 1.026355 0.160191 1.000000 1.000000 1.000000 1.000000 2.000000
1 Product_Info_2 object 59381 19 0 0.000000 0 0.000000
2 Product_Info_3 int64 59381 34 0 0.000000 0 0.000000 59381.000000 24.415655 5.072885 1.000000 26.000000 26.000000 26.000000 38.000000
4 Product_Info_5 int64 59381 2 0 0.000000 0 0.000000 59381.000000 2.006955 0.083107 2.000000 2.000000 2.000000 2.000000 3.000000
5 Product_Info_6 int64 59381 2 0 0.000000 0 0.000000 59381.000000 2.673599 0.739103 1.000000 3.000000 3.000000 3.000000 3.000000
6 Product_Info_7 int64 59381 3 0 0.000000 0 0.000000 59381.000000 1.043583 0.291949 1.000000 1.000000 1.000000 1.000000 3.000000
In [20]:
df_product.bp.missing()
Number of missing values features: 0
cols_missing_high = []
cols_missing_low = []
Out[20]:
Feature Type Count Missing Zeros Unique MissingPct ZerosPct count mean std min 25% 50% 75% max
In [21]:
df_product.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Employment Features

In [22]:
df_employement = df[df.columns[df.columns.str.startswith('Employment')]]
print(df_employement.shape)
df_employement.head()
(59381, 6)
Out[22]:
Employment_Info_1 Employment_Info_2 Employment_Info_3 Employment_Info_4 Employment_Info_5 Employment_Info_6
0 0.028 12 1 0.0 3 NaN
1 0.000 1 3 0.0 2 0.0018
2 0.030 9 1 0.0 2 0.0300
3 0.042 9 1 0.0 3 0.2000
4 0.027 9 1 0.0 2 0.0500
In [23]:
df_employement.bp.get_column_descriptions()
Out[23]:
Feature Type Count Unique Missing MissingPct Zeros ZerosPct count mean std min 25% 50% 75% max
5 Employment_Info_6 float64 59381 992 10854 18.280000 4042 6.810000 48527.000000 0.361469 0.349551 0.000000 0.060000 0.250000 0.550000 1.000000
3 Employment_Info_4 float64 59381 871 6779 11.420000 44659 75.210000 52602.000000 0.006283 0.032816 0.000000 0.000000 0.000000 0.000000 1.000000
0 Employment_Info_1 float64 59381 1936 19 0.030000 3688 6.210000 59362.000000 0.077582 0.082347 0.000000 0.035000 0.060000 0.100000 1.000000
1 Employment_Info_2 int64 59381 36 0 0.000000 0 0.000000 59381.000000 8.641821 4.227082 1.000000 9.000000 9.000000 9.000000 38.000000
2 Employment_Info_3 int64 59381 2 0 0.000000 0 0.000000 59381.000000 1.300904 0.715034 1.000000 1.000000 1.000000 1.000000 3.000000
4 Employment_Info_5 int64 59381 2 0 0.000000 0 0.000000 59381.000000 2.142958 0.350033 2.000000 2.000000 2.000000 2.000000 3.000000
In [24]:
df_employement.bp.missing()
Number of missing values features: 3
cols_missing_high = []
cols_missing_low = ['Employment_Info_6', 'Employment_Info_4', 'Employment_Info_1']
Out[24]:
Feature Type Count Missing Zeros Unique MissingPct ZerosPct count mean std min 25% 50% 75% max
5 Employment_Info_6 float64 59381 10854 4042 992 18.280000 6.810000 48527.000000 0.361469 0.349551 0.000000 0.060000 0.250000 0.550000 1.000000
3 Employment_Info_4 float64 59381 6779 44659 871 11.420000 75.210000 52602.000000 0.006283 0.032816 0.000000 0.000000 0.000000 0.000000 1.000000
0 Employment_Info_1 float64 59381 19 3688 1936 0.030000 6.210000 59362.000000 0.077582 0.082347 0.000000 0.035000 0.060000 0.100000 1.000000
In [25]:
df_employement.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Insurance Features

In [26]:
df_insurance = df[df.columns[df.columns.str.startswith('Insurance')]]
print(df_insurance.shape)
df_insurance.head()
(59381, 8)
Out[26]:
Insurance_History_1 Insurance_History_2 Insurance_History_3 Insurance_History_4 Insurance_History_5 Insurance_History_7 Insurance_History_8 Insurance_History_9
0 1 1 3 1 0.000667 1 1 2
1 2 1 3 1 0.000133 1 3 2
2 2 1 1 3 NaN 3 2 3
3 2 1 1 3 NaN 3 2 3
4 2 1 1 3 NaN 3 2 3
In [27]:
df_insurance.bp.get_column_descriptions()
Out[27]:
Feature Type Count Unique Missing MissingPct Zeros ZerosPct count mean std min 25% 50% 75% max
4 Insurance_History_5 float64 59381 2265 25396 42.770000 19 0.030000 33985.000000 0.001733 0.007338 0.000000 0.000400 0.000973 0.002000 1.000000
0 Insurance_History_1 int64 59381 2 0 0.000000 0 0.000000 59381.000000 1.727606 0.445195 1.000000 1.000000 2.000000 2.000000 2.000000
1 Insurance_History_2 int64 59381 3 0 0.000000 0 0.000000 59381.000000 1.055792 0.329328 1.000000 1.000000 1.000000 1.000000 3.000000
2 Insurance_History_3 int64 59381 3 0 0.000000 0 0.000000 59381.000000 2.146983 0.989139 1.000000 1.000000 3.000000 3.000000 3.000000
3 Insurance_History_4 int64 59381 3 0 0.000000 0 0.000000 59381.000000 1.958707 0.945739 1.000000 1.000000 2.000000 3.000000 3.000000
5 Insurance_History_7 int64 59381 3 0 0.000000 0 0.000000 59381.000000 1.901989 0.971223 1.000000 1.000000 1.000000 3.000000 3.000000
6 Insurance_History_8 int64 59381 3 0 0.000000 0 0.000000 59381.000000 2.048484 0.755149 1.000000 1.000000 2.000000 3.000000 3.000000
7 Insurance_History_9 int64 59381 3 0 0.000000 0 0.000000 59381.000000 2.419360 0.509577 1.000000 2.000000 2.000000 3.000000 3.000000
In [28]:
df_insurance.bp.missing()
Number of missing values features: 1
cols_missing_high = []
cols_missing_low = ['Insurance_History_5']
Out[28]:
Feature Type Count Missing Zeros Unique MissingPct ZerosPct count mean std min 25% 50% 75% max
4 Insurance_History_5 float64 59381 25396 19 2265 42.770000 0.030000 33985.000000 0.001733 0.007338 0.000000 0.000400 0.000973 0.002000 1.000000
In [29]:
df_insurance.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Family Features

In [30]:
df_family = df[df.columns[df.columns.str.startswith('Family')]]
print(df_family.shape)
df_family.head()
(59381, 5)
Out[30]:
Family_Hist_1 Family_Hist_2 Family_Hist_3 Family_Hist_4 Family_Hist_5
0 2 NaN 0.598039 NaN 0.526786
1 2 0.188406 NaN 0.084507 NaN
2 3 0.304348 NaN 0.225352 NaN
3 3 0.420290 NaN 0.352113 NaN
4 2 0.463768 NaN 0.408451 NaN
In [31]:
df_family.bp.get_column_descriptions()
Out[31]:
Feature Type Count Unique Missing MissingPct Zeros ZerosPct count mean std min 25% 50% 75% max
4 Family_Hist_5 float64 59381 90 41811 70.410000 3 0.010000 17570.000000 0.484635 0.129200 0.000000 0.401786 0.508929 0.580357 1.000000
2 Family_Hist_3 float64 59381 90 34241 57.660000 4 0.010000 25140.000000 0.497737 0.140187 0.000000 0.401961 0.519608 0.598039 1.000000
1 Family_Hist_2 float64 59381 68 28656 48.260000 1 0.000000 30725.000000 0.474550 0.154959 0.000000 0.362319 0.463768 0.579710 1.000000
3 Family_Hist_4 float64 59381 68 19184 32.310000 9 0.020000 40197.000000 0.444890 0.163012 0.000000 0.323944 0.422535 0.563380 0.943662
0 Family_Hist_1 int64 59381 3 0 0.000000 0 0.000000 59381.000000 2.686230 0.483159 1.000000 2.000000 3.000000 3.000000 3.000000
In [32]:
df_family.bp.missing()
Number of missing values features: 4
cols_missing_high = []
cols_missing_low = ['Family_Hist_5', 'Family_Hist_3', 'Family_Hist_2', 'Family_Hist_4']
Out[32]:
Feature Type Count Missing Zeros Unique MissingPct ZerosPct count mean std min 25% 50% 75% max
4 Family_Hist_5 float64 59381 41811 3 90 70.410000 0.010000 17570.000000 0.484635 0.129200 0.000000 0.401786 0.508929 0.580357 1.000000
2 Family_Hist_3 float64 59381 34241 4 90 57.660000 0.010000 25140.000000 0.497737 0.140187 0.000000 0.401961 0.519608 0.598039 1.000000
1 Family_Hist_2 float64 59381 28656 1 68 48.260000 0.000000 30725.000000 0.474550 0.154959 0.000000 0.362319 0.463768 0.579710 1.000000
3 Family_Hist_4 float64 59381 19184 9 68 32.310000 0.020000 40197.000000 0.444890 0.163012 0.000000 0.323944 0.422535 0.563380 0.943662
In [33]:
df_family.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Medical Features

In [34]:
df_medical = df[df.columns[df.columns.str.startswith('Medical')]]
print(df_medical.shape)
df_medical.head()
(59381, 89)
Out[34]:
Medical_History_1 Medical_History_2 Medical_History_3 Medical_History_4 Medical_History_5 Medical_History_6 Medical_History_7 Medical_History_8 Medical_History_9 Medical_History_10 Medical_History_11 Medical_History_12 Medical_History_13 Medical_History_14 Medical_History_15 Medical_History_16 Medical_History_17 Medical_History_18 Medical_History_19 Medical_History_20 Medical_History_21 Medical_History_22 Medical_History_23 Medical_History_24 Medical_History_25 Medical_History_26 Medical_History_27 Medical_History_28 Medical_History_29 Medical_History_30 Medical_History_31 Medical_History_32 Medical_History_33 Medical_History_34 Medical_History_35 Medical_History_36 Medical_History_37 Medical_History_38 Medical_History_39 Medical_History_40 Medical_History_41 Medical_Keyword_1 Medical_Keyword_2 Medical_Keyword_3 Medical_Keyword_4 Medical_Keyword_5 Medical_Keyword_6 Medical_Keyword_7 Medical_Keyword_8 Medical_Keyword_9 Medical_Keyword_10 Medical_Keyword_11 Medical_Keyword_12 Medical_Keyword_13 Medical_Keyword_14 Medical_Keyword_15 Medical_Keyword_16 Medical_Keyword_17 Medical_Keyword_18 Medical_Keyword_19 Medical_Keyword_20 Medical_Keyword_21 Medical_Keyword_22 Medical_Keyword_23 Medical_Keyword_24 Medical_Keyword_25 Medical_Keyword_26 Medical_Keyword_27 Medical_Keyword_28 Medical_Keyword_29 Medical_Keyword_30 Medical_Keyword_31 Medical_Keyword_32 Medical_Keyword_33 Medical_Keyword_34 Medical_Keyword_35 Medical_Keyword_36 Medical_Keyword_37 Medical_Keyword_38 Medical_Keyword_39 Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48
0 4.0 112 2 1 1 3 2 2 1 NaN 3 2 3 3 240.0 3 3 1 1 2 1 2 3 NaN 1 3 3 1 3 2 3 NaN 1 3 1 2 2 1 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 5.0 412 2 1 1 3 2 2 1 NaN 3 2 3 3 0.0 1 3 1 1 2 1 2 3 NaN 1 3 3 1 3 2 3 NaN 3 1 1 2 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 10.0 3 2 2 1 3 2 2 2 NaN 3 2 3 3 NaN 1 3 1 1 2 1 2 3 NaN 2 2 3 1 3 2 3 NaN 3 3 1 3 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0.0 350 2 2 1 3 2 2 2 NaN 3 2 3 3 NaN 1 3 1 1 2 2 2 3 NaN 1 3 3 1 3 2 3 NaN 3 3 1 2 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 NaN 162 2 2 1 3 2 2 2 NaN 3 2 3 3 NaN 1 3 1 1 2 1 2 3 NaN 2 2 3 1 3 2 3 NaN 3 3 1 3 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [35]:
# df_medical.bp.get_column_descriptions()
In [36]:
df_medical.bp.missing()
Number of missing values features: 5
cols_missing_high = ['Medical_History_10', 'Medical_History_32', 'Medical_History_24']
cols_missing_low = ['Medical_History_15', 'Medical_History_1']
Out[36]:
Feature Type Count Missing Zeros Unique MissingPct ZerosPct count mean std min 25% 50% 75% max
9 Medical_History_10 float64 59381 58824 75 103 99.060000 0.130000 557.000000 141.118492 107.759559 0.000000 8.000000 229.000000 240.000000 240.000000
31 Medical_History_32 float64 59381 58274 744 95 98.140000 1.250000 1107.000000 11.965673 38.718774 0.000000 0.000000 0.000000 2.000000 240.000000
23 Medical_History_24 float64 59381 55580 769 227 93.600000 1.300000 3801.000000 50.635622 78.149069 0.000000 1.000000 8.000000 64.000000 240.000000
14 Medical_History_15 float64 59381 44596 2135 241 75.100000 3.600000 14785.000000 123.760974 98.516206 0.000000 17.000000 117.000000 240.000000 240.000000
0 Medical_History_1 float64 59381 8889 4789 171 14.970000 8.060000 50492.000000 7.962172 13.027697 0.000000 2.000000 4.000000 9.000000 240.000000
In [37]:
df_medical.merge(df[target],left_index=True,right_index=True).bp.plotly_corr(target)
In [38]:
df_medical.merge(df[target],left_index=True,right_index=True
                 ).bp.plot_corr_sns(target=target)

Class Balance

In [39]:
sns.catplot(x='Response' , kind='count' , data=df)
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x7f816e078a10>
In [40]:
df.bp.plot_cat('Response')
In [41]:
df.bp.plot_pareto('Response')

Dataframe Statistics Profiles

We can look at the overview of the data such as histogram, missing values, correlation, skewness using pandas_profiling module. Bear in mind that it may take long time to produce results. So saving them in a output file and only producing them once is a good way to go.

In [42]:
import pandas_profiling
In [43]:
if ENV_COLAB:
    out_dir = 'drive/My Drive/Colab Notebooks/outputs'
else:
    out_dir = '../outputs'
    
out_dir = out_dir + '/Prudential/'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

out_dir
In [45]:
df.head(2)
Out[45]:
Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht Wt BMI Employment_Info_1 Employment_Info_2 Employment_Info_3 Employment_Info_4 Employment_Info_5 Employment_Info_6 InsuredInfo_1 InsuredInfo_2 InsuredInfo_3 InsuredInfo_4 InsuredInfo_5 InsuredInfo_6 InsuredInfo_7 Insurance_History_1 Insurance_History_2 Insurance_History_3 Insurance_History_4 Insurance_History_5 Insurance_History_7 Insurance_History_8 Insurance_History_9 Family_Hist_1 Family_Hist_2 Family_Hist_3 Family_Hist_4 Family_Hist_5 Medical_History_1 Medical_History_2 Medical_History_3 Medical_History_4 Medical_History_5 Medical_History_6 Medical_History_7 Medical_History_8 Medical_History_9 Medical_History_10 Medical_History_11 Medical_History_12 Medical_History_13 Medical_History_14 Medical_History_15 Medical_History_16 Medical_History_17 Medical_History_18 Medical_History_19 Medical_History_20 Medical_History_21 Medical_History_22 Medical_History_23 Medical_History_24 Medical_History_25 Medical_History_26 Medical_History_27 Medical_History_28 Medical_History_29 Medical_History_30 Medical_History_31 Medical_History_32 Medical_History_33 Medical_History_34 Medical_History_35 Medical_History_36 Medical_History_37 Medical_History_38 Medical_History_39 Medical_History_40 Medical_History_41 Medical_Keyword_1 Medical_Keyword_2 Medical_Keyword_3 Medical_Keyword_4 Medical_Keyword_5 Medical_Keyword_6 Medical_Keyword_7 Medical_Keyword_8 Medical_Keyword_9 Medical_Keyword_10 Medical_Keyword_11 Medical_Keyword_12 Medical_Keyword_13 Medical_Keyword_14 Medical_Keyword_15 Medical_Keyword_16 Medical_Keyword_17 Medical_Keyword_18 Medical_Keyword_19 Medical_Keyword_20 Medical_Keyword_21 Medical_Keyword_22 Medical_Keyword_23 Medical_Keyword_24 Medical_Keyword_25 Medical_Keyword_26 Medical_Keyword_27 Medical_Keyword_28 Medical_Keyword_29 Medical_Keyword_30 Medical_Keyword_31 Medical_Keyword_32 Medical_Keyword_33 Medical_Keyword_34 Medical_Keyword_35 Medical_Keyword_36 Medical_Keyword_37 Medical_Keyword_38 Medical_Keyword_39 Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response
0 2 1 D3 10 0.076923 2 1 1 0.641791 0.581818 0.148536 0.323008 0.028 12 1 0.0 3 NaN 1 2 6 3 1 2 1 1 1 3 1 0.000667 1 1 2 2 NaN 0.598039 NaN 0.526786 4.0 112 2 1 1 3 2 2 1 NaN 3 2 3 3 240.0 3 3 1 1 2 1 2 3 NaN 1 3 3 1 3 2 3 NaN 1 3 1 2 2 1 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
1 5 1 A1 26 0.076923 2 3 1 0.059701 0.600000 0.131799 0.272288 0.000 1 3 0.0 2 0.0018 1 2 6 3 1 2 1 2 1 3 1 0.000133 1 3 2 2 0.188406 NaN 0.084507 NaN 5.0 412 2 1 1 3 2 2 1 NaN 3 2 3 3 0.0 1 3 1 1 2 1 2 3 NaN 1 3 3 1 3 2 3 NaN 3 1 1 2 2 1 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
In [46]:
# profile = df_misc.profile_report(html={'style': {'full_width':True}})
# profile.to_file(out_dir + 'df_profile.html')
In [47]:
dfs = [df_misc, df_product, df_employement, df_insurance, df_medical]
ofiles = ['df_misc', 'df_product', 'df_employment','df_insurance', 'df_medical']
ofiles = [out_dir + i +'_profile.html' for i in ofiles]

ofiles[0]
Out[47]:
'../outputs/Prudential/df_misc_profile.html'
In [48]:
profile = df_misc.profile_report(html={'style': {'full_width':True}})
# profile
In [49]:
profile = df_misc.profile_report(explorative=True)
profile



Out[49]:

In [50]:
%%time
for dfx,ofile in zip(dfs,ofiles):
    dfx = dfx.merge(df[target],left_index=True,right_index=True)
    if not os.path.isfile(ofile):
         profile = dfx.profile_report(html={'style': {'full_width':True}})
         profile.to_file(ofile)
CPU times: user 21.7 ms, sys: 10.6 ms, total: 32.4 ms
Wall time: 36.1 ms

Time Taken

In [51]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole notebook: 0 hr 1 min 24 secs
In [ ]: