File description

train.csv - the training set, contains the Response values
test.csv - the test set, you must predict the Response variable for all rows in this file
sample_submission.csv - a sample submission file in the correct format

**Data fields**

Variable	Description
Id	A unique identifier associated with an application.
Product_Info_1-7	A set of normalized variables relating to the product applied for
Ins_Age	Normalized age of applicant
Ht	Normalized height of applicant
Wt	Normalized weight of applicant
BMI	Normalized BMI of applicant
Employment_Info_1-6	A set of normalized variables relating to the employment history of the applicant.
InsuredInfo_1-6	A set of normalized variables providing information about the applicant.
Insurance_History_1-9	A set of normalized variables relating to the insurance history of the applicant.
Family_Hist_1-5	A set of normalized variables relating to the family history of the applicant.
Medical_History_1-41	A set of normalized variables relating to the medical history of the applicant.
Medical_Keyword_1-48	A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
Response	This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

Imports¶

import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True,font_scale=1.5)

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline


# random state
SEED = 100
time_start_notebook = time.time()

# Jupyter notebook settings for pandas
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 50)

import scipy
from scipy import stats
import IPython
from IPython.display import display
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Google colab

%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    ### mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    ### load the data dir
    dat_dir = 'drive/My Drive/Colab Notebooks/data/'
    sys.path.append(dat_dir)

    ### Image dir
    img_dir = 'drive/My Drive/Colab Notebooks/images/'
    if not os.path.isdir(img_dir): os.makedirs(img_dir)
    sys.path.append(img_dir)

    ### Output dir
    out_dir = 'drive/My Drive/Colab Notebooks/outputs/'
    if not os.path.isdir(out_dir): os.makedirs(out_dir)
    sys.path.append(out_dir)

    ### Also install my custom module
    module_dir = 'drive/My Drive/Colab Notebooks/Bhishan_Modules/' 
    sys.path.append(module_dir)
    !cd drive/My Drive/Colab Notebooks/Bhishan_Modules/
    !pip install -e bhishan
    !cd -

    # pandas profiling
    !pip install -U pandas-profiling # we need restart
    import pandas_profiling

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.

import sklearn
import pandas_profiling

import bhishan
from bhishan import bp

print(bhishan.__version__)

0.3.1

%load_ext watermark
%watermark -a "Bhishan Poudel" -dvm
%watermark -iv

Bhishan Poudel 2020-06-22 

CPython 3.7.7
IPython 7.13.0

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
seaborn          0.10.1
numpy            1.18.4
pandas           1.0.3
pandas_profiling 2.8.0
json             2.0.9
matplotlib       3.2.1
IPython          7.13.0
sklearn          0.23.1
scipy            1.4.1
autopep8         1.5.2

%load_ext autoreload

%autoreload 2

Load the data¶

df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/Prudential_Insurance/raw/train.csv.zip?raw=true',compression='zip')
print(df.shape)
df.head()

(59381, 128)

target = 'Response'

Miscellaneous Features¶

df_misc = df[['Ins_Age','Ht','Wt','BMI']]
print(df_misc.shape)
df_misc.head()

(59381, 4)

df_misc.bp.get_column_descriptions()

df_misc.bp.missing()

Number of missing values features: 0
cols_missing_high = []
cols_missing_low = []

df_misc.bp.plot_corr()

sns.scatterplot(x = 'Ins_Age', y = 'BMI',hue=target,  data = df);

df_misc.merge(df[target],left_index=True,right_index=True).bp.plot_corr()

Product Features¶

df_product = df[df.columns[df.columns.str.startswith('Product')]]
print(df_product.shape)
df_product.head()

(59381, 7)

df_product.bp.get_column_descriptions(style=True)

df_product.bp.missing()

Number of missing values features: 0
cols_missing_high = []
cols_missing_low = []

df_product.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Employment Features¶

df_employement = df[df.columns[df.columns.str.startswith('Employment')]]
print(df_employement.shape)
df_employement.head()

(59381, 6)

df_employement.bp.get_column_descriptions()

df_employement.bp.missing()

Number of missing values features: 3
cols_missing_high = []
cols_missing_low = ['Employment_Info_6', 'Employment_Info_4', 'Employment_Info_1']

df_employement.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Insurance Features¶

df_insurance = df[df.columns[df.columns.str.startswith('Insurance')]]
print(df_insurance.shape)
df_insurance.head()

(59381, 8)

df_insurance.bp.get_column_descriptions()

df_insurance.bp.missing()

Number of missing values features: 1
cols_missing_high = []
cols_missing_low = ['Insurance_History_5']

df_insurance.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Family Features¶

df_family = df[df.columns[df.columns.str.startswith('Family')]]
print(df_family.shape)
df_family.head()

(59381, 5)

df_family.bp.get_column_descriptions()

df_family.bp.missing()

Number of missing values features: 4
cols_missing_high = []
cols_missing_low = ['Family_Hist_5', 'Family_Hist_3', 'Family_Hist_2', 'Family_Hist_4']

df_family.merge(df[target],left_index=True,right_index=True).bp.plot_corr(xrot=75)

Medical Features¶

df_medical = df[df.columns[df.columns.str.startswith('Medical')]]
print(df_medical.shape)
df_medical.head()

(59381, 89)

# df_medical.bp.get_column_descriptions()

df_medical.bp.missing()

Number of missing values features: 5
cols_missing_high = ['Medical_History_10', 'Medical_History_32', 'Medical_History_24']
cols_missing_low = ['Medical_History_15', 'Medical_History_1']

df_medical.merge(df[target],left_index=True,right_index=True).bp.plotly_corr(target)

df_medical.merge(df[target],left_index=True,right_index=True
                 ).bp.plot_corr_sns(target=target)

Class Balance¶

sns.catplot(x='Response' , kind='count' , data=df)

<seaborn.axisgrid.FacetGrid at 0x7f816e078a10>

df.bp.plot_cat('Response')

df.bp.plot_pareto('Response')

Dataframe Statistics Profiles¶

We can look at the overview of the data such as histogram, missing values, correlation, skewness using pandas_profiling module. Bear in mind that it may take long time to produce results. So saving them in a output file and only producing them once is a good way to go.

import pandas_profiling

if ENV_COLAB:
    out_dir = 'drive/My Drive/Colab Notebooks/outputs'
else:
    out_dir = '../outputs'
    
out_dir = out_dir + '/Prudential/'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

out_dir

df.head(2)

# profile = df_misc.profile_report(html={'style': {'full_width':True}})
# profile.to_file(out_dir + 'df_profile.html')

dfs = [df_misc, df_product, df_employement, df_insurance, df_medical]
ofiles = ['df_misc', 'df_product', 'df_employment','df_insurance', 'df_medical']
ofiles = [out_dir + i +'_profile.html' for i in ofiles]

ofiles[0]

'../outputs/Prudential/df_misc_profile.html'

profile = df_misc.profile_report(html={'style': {'full_width':True}})
# profile

profile = df_misc.profile_report(explorative=True)
profile

%%time
for dfx,ofile in zip(dfs,ofiles):
    dfx = dfx.merge(df[target],left_index=True,right_index=True)
    if not os.path.isfile(ofile):
         profile = dfx.profile_report(html={'style': {'full_width':True}})
         profile.to_file(ofile)

CPU times: user 21.7 ms, sys: 10.6 ms, total: 32.4 ms
Wall time: 36.1 ms

Time Taken¶

time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

Time taken to run whole notebook: 0 hr 1 min 24 secs

	Id	Product_Info_1	Product_Info_2	Product_Info_3	Product_Info_4	Product_Info_5	Product_Info_6	Product_Info_7	Ins_Age	Ht	Wt	BMI	Employment_Info_1	Employment_Info_2	Employment_Info_3	Employment_Info_5	Employment_Info_6	InsuredInfo_1	InsuredInfo_2	InsuredInfo_3	InsuredInfo_4	InsuredInfo_5	InsuredInfo_6	InsuredInfo_7	Insurance_History_1	Insurance_History_2	Insurance_History_3	Insurance_History_4	Insurance_History_5	Insurance_History_7	Insurance_History_8	Insurance_History_9	Family_Hist_1	Family_Hist_2	Family_Hist_3	Family_Hist_4	Family_Hist_5	Medical_History_1	Medical_History_2	Medical_History_3	Medical_History_4	Medical_History_5	Medical_History_6	Medical_History_7	Medical_History_8	Medical_History_9	Medical_History_10	Medical_History_11	Medical_History_12	Medical_History_13	Medical_History_14	Medical_History_15	Medical_History_16	Medical_History_17	Medical_History_18	Medical_History_19	Medical_History_20	Medical_History_21	Medical_History_22	Medical_History_23	Medical_History_24	Medical_History_25	Medical_History_26	Medical_History_27	Medical_History_28	Medical_History_29	Medical_History_30	Medical_History_31	Medical_History_32	Medical_History_33	Medical_History_34	Medical_History_35	Medical_History_36	Medical_History_37	Medical_History_38	Medical_History_39	Medical_History_40	Medical_History_41	Medical_Keyword_32	Response
0	2	1	D3	10	0.076923	2	1	1	0.641791	0.581818	0.148536	0.323008	0.028	12	1	3	NaN	1	2	6	3	1	2	1	1	1	3	1	0.000667	1	1	2	2	NaN	0.598039	NaN	0.526786	4.0	112	2	1	1	3	2	2	1	NaN	3	2	3	3	240.0	3	3	1	1	2	1	2	3	NaN	1	3	3	1	3	2	3	NaN	1	3	1	2	2	1	3	3	3	0	8
1	5	1	A1	26	0.076923	2	3	1	0.059701	0.600000	0.131799	0.272288	0.000	1	3	2	0.0018	1	2	6	3	1	2	1	2	1	3	1	0.000133	1	3	2	2	0.188406	NaN	0.084507	NaN	5.0	412	2	1	1	3	2	2	1	NaN	3	2	3	3	0.0	1	3	1	1	2	1	2	3	NaN	1	3	3	1	3	2	3	NaN	3	1	1	2	2	1	3	3	1	0	4
2	6	1	E1	26	0.076923	2	3	1	0.029851	0.745455	0.288703	0.428780	0.030	9	1	2	0.0300	1	2	8	3	1	1	1	2	1	1	3	NaN	3	2	3	3	0.304348	NaN	0.225352	NaN	10.0	3	2	2	1	3	2	2	2	NaN	3	2	3	3	NaN	1	3	1	1	2	1	2	3	NaN	2	2	3	1	3	2	3	NaN	3	3	1	3	2	1	3	3	1	0	8
3	7	1	D4	10	0.487179	2	3	1	0.164179	0.672727	0.205021	0.352438	0.042	9	1	3	0.2000	2	2	8	3	1	2	1	2	1	1	3	NaN	3	2	3	3	0.420290	NaN	0.352113	NaN	0.0	350	2	2	1	3	2	2	2	NaN	3	2	3	3	NaN	1	3	1	1	2	2	2	3	NaN	1	3	3	1	3	2	3	NaN	3	3	1	2	2	1	3	3	1	1	8
4	8	1	D2	26	0.230769	2	3	1	0.417910	0.654545	0.234310	0.424046	0.027	9	1	2	0.0500	1	2	6	3	1	2	1	2	1	1	3	NaN	3	2	3	2	0.463768	NaN	0.408451	NaN	NaN	162	2	2	1	3	2	2	2	NaN	3	2	3	3	NaN	1	3	1	1	2	1	2	3	NaN	2	2	3	1	3	2	3	NaN	3	3	1	3	2	1	3	3	1	0	8

	Ins_Age	Ht	Wt	BMI
0	0.641791	0.581818	0.148536	0.323008
1	0.059701	0.600000	0.131799	0.272288
2	0.029851	0.745455	0.288703	0.428780
3	0.164179	0.672727	0.205021	0.352438
4	0.417910	0.654545	0.234310	0.424046

	Feature	Type	Count	Unique	Zeros	ZerosPct	count	mean	std	25%	50%	75%	max
0	Ins_Age	float64	59381	65	287	0.480000	59381.000000	0.405567	0.197190	0.238806	0.402985	0.567164	1.000000
1	Ht	float64	59381	39	1	0.000000	59381.000000	0.707283	0.074239	0.654545	0.709091	0.763636	1.000000
2	Wt	float64	59381	300	1	0.000000	59381.000000	0.292587	0.089037	0.225941	0.288703	0.345188	1.000000
3	BMI	float64	59381	3256	1	0.000000	59381.000000	0.469462	0.122213	0.385517	0.451349	0.532858	1.000000

	Feature	Type	Count	Unique	Zeros	ZerosPct	count	mean	std	min	25%	50%	75%	max
3	Product_Info_4	float64	59381	1491	1036	1.740000	59381.000000	0.328952	0.282562	0.000000	0.076923	0.230769	0.487179	1.000000
0	Product_Info_1	int64	59381	2	0	0.000000	59381.000000	1.026355	0.160191	1.000000	1.000000	1.000000	1.000000	2.000000
1	Product_Info_2	object	59381	19	0	0.000000
2	Product_Info_3	int64	59381	34	0	0.000000	59381.000000	24.415655	5.072885	1.000000	26.000000	26.000000	26.000000	38.000000
4	Product_Info_5	int64	59381	2	0	0.000000	59381.000000	2.006955	0.083107	2.000000	2.000000	2.000000	2.000000	3.000000
5	Product_Info_6	int64	59381	2	0	0.000000	59381.000000	2.673599	0.739103	1.000000	3.000000	3.000000	3.000000	3.000000
6	Product_Info_7	int64	59381	3	0	0.000000	59381.000000	1.043583	0.291949	1.000000	1.000000	1.000000	1.000000	3.000000

	Employment_Info_1	Employment_Info_2	Employment_Info_3	Employment_Info_5	Employment_Info_6
0	0.028	12	1	3	NaN
1	0.000	1	3	2	0.0018
2	0.030	9	1	2	0.0300
3	0.042	9	1	3	0.2000
4	0.027	9	1	2	0.0500

	Feature	Type	Count	Missing	Zeros	Unique	MissingPct	ZerosPct	count	mean	std	25%	50%	75%	max
9	Medical_History_10	float64	59381	58824	75	103	99.060000	0.130000	557.000000	141.118492	107.759559	8.000000	229.000000	240.000000	240.000000
31	Medical_History_32	float64	59381	58274	744	95	98.140000	1.250000	1107.000000	11.965673	38.718774	0.000000	0.000000	2.000000	240.000000
23	Medical_History_24	float64	59381	55580	769	227	93.600000	1.300000	3801.000000	50.635622	78.149069	1.000000	8.000000	64.000000	240.000000
14	Medical_History_15	float64	59381	44596	2135	241	75.100000	3.600000	14785.000000	123.760974	98.516206	17.000000	117.000000	240.000000	240.000000
0	Medical_History_1	float64	59381	8889	4789	171	14.970000	8.060000	50492.000000	7.962172	13.027697	2.000000	4.000000	9.000000	240.000000

Table of Contents