Project Introduction

Project : French Motor Claims
Author : Bhishan Poudel, Ph.D Physics
Goal : Implement Frequency modelling, Severity modelling and Pure Premium Modelling
Tools : pandas, scikit-learn, xgboost,pygam

References: - https://www.kaggle.com/floser/french-motor-claims-datasets-fremtpl2freq - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html

Project Notebooks

Notebook Rendered Description Author
a01_data_cleaning.ipynb ipynb, rendered ohe, kbin, logscaling Bhishan Poudel
b01_freq_modelling.ipynb ipynb, rendered Poisson Bhishan Poudel
b02_severity_modelling.ipynb ipynb, rendered Gamma Bhishan Poudel
b03_pure_premium_modelling.ipynb ipynb, rendered Poisson*Gamma and Tweedie Bhishan Poudel
b04_tweedie_vs_freqSev.ipynb ipynb, rendered comparison Bhishan Poudel
b05_lorentz_curves_comparison.ipynb ipynb, rendered Lorentz Curve Bhishan Poudel
c01_xgboost_tweedie.ipynb ipynb, rendered 'objective':'reg:tweedie' Bhishan Poudel
d01_gam_linear.ipynb ipynb, rendered n_splies=10, grid_search Bhishan Poudel

Data

Data Cleaning

Some of the features are chosen for modelling.

one hot encoding = ["VehBrand", "VehPower", "VehGas", "Region", "Area"]
kbins discretizer = ["VehAge", "DrivAge"]
log and scaling = ["Density"]
pass through =  ["BonusMalus"]

Results

Module Distribution y_train sample_weight train D2 test D2 train MAE test MSE train MAE test MSE
sklearn Frequency Modelling (Poisson Distribution) df_train['Frequency'] df_train['Exposure'] 0.051384 0.048138 0.232085 0.224547 4.738399 2.407906
sklearn Severity Modelling (Gamma Distribution) df_train.loc[mask_train, 'AvgClaimAmount'] df_train.loc[mask_train, 'ClaimNb'] - 3.638157e-03 -4.747382e-04 1.859814e+03 1.856312e+03 4.959565e+06
sklearn Pure Premium Modelling (TweedieRegressor) df_train['PurePremium'] df_train['Exposure'] 2.018645e-02 1.353285e-02 6.580440e+02 4.927505e+02 1.478259e+09 1.622053e+08
xgboost Xgboost Tweedie Regression dtrain.set_base_margin(np.log(df_train['Exposure'].to_numpy()) dtest.set_base_margin(np.log(df_test['Exposure'].to_numpy())) - - 1.760538e+03 1.588351e+03 1.481952e+09 1.659363e+08
pygam GAM Linear Model df_train["AvgClaimAmount"].values N/A - - 1.686438e+02 1.655408e+02 1.785332e+06 1.647533e+06