import numpy as np
import pandas as pd
import seaborn as sns
import os,sys,time
import sklearn
import scipy
import matplotlib.pyplot as plt
sns.set()
import json
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor, GammaRegressor, TweedieRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_tweedie_deviance
SEED = 100
pd.set_option('max_columns',100)
pd.set_option('plotting.backend','matplotlib') # matplotlib, bokeh, altair, plotly
%load_ext watermark
%watermark -iv
autopep8 1.5.2 seaborn 0.11.0 numpy 1.18.4 joblib 0.16.0 pandas 1.1.0 scipy 1.4.1 sklearn 0.23.1 json 2.0.9
df = pd.read_csv('../data/processed/clean_data.csv.zip', compression='zip')
print(df.shape)
df.head(2).append(df.tail(2))
(100000, 15)
ClaimNb | Exposure | Area | VehPower | VehAge | DrivAge | BonusMalus | VehBrand | VehGas | Density | Region | ClaimAmount | PurePremium | Frequency | AvgClaimAmount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.10 | D | 5 | 0 | 55 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0 | 0.77 | D | 5 | 0 | 55 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 | 0.0 | 0.0 |
99998 | 0 | 0.90 | C | 7 | 9 | 44 | 50 | B1 | Regular | 191 | R24 | 0.0 | 0.0 | 0.0 | 0.0 |
99999 | 0 | 0.90 | E | 4 | 12 | 53 | 50 | B1 | Regular | 4116 | R24 | 0.0 | 0.0 | 0.0 | 0.0 |
X = scipy.sparse.load_npz("../data/processed/X.npz")
df.head(2)
ClaimNb | Exposure | Area | VehPower | VehAge | DrivAge | BonusMalus | VehBrand | VehGas | Density | Region | ClaimAmount | PurePremium | Frequency | AvgClaimAmount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.10 | D | 5 | 0 | 55 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0 | 0.77 | D | 5 | 0 | 55 | 50 | B12 | Regular | 1217 | R82 | 0.0 | 0.0 | 0.0 | 0.0 |
np.array(X[0].todense())[0][-5:] # last elements of first row
array([ 0. , 1. , 0. , 0.69864446, 50. ])
with open("../data/processed/features.json") as fi:
json_features = json.load(fi)
json_features.keys()
dict_keys(['cols_ohe_before', 'cols_kbin', 'cols_log_scale', 'cols_pass', 'feature_names_before', 'feature_names_after', 'desc'])
from sklearn.model_selection import train_test_split
df_train, df_test, X_train, X_test = train_test_split(df, X, random_state=SEED)
df_train.shape, df_test.shape, X_train.shape, X_test.shape
((75000, 15), (25000, 15), (75000, 71), (25000, 71))
Ref: https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression
We can model the total claim amount per unit of exposure using two methods:
Generalized Linear Models (GLM) extend linear models in two ways 10. First, the predicted values $\hat{y}$ are linked to a linear combination of the input variables via an inverse link function h as
$$ \hat{y}(w, X)=h(X w) $$Secondly, the squared loss function is replaced by the unit deviance of a distribution in the exponential family (or more precisely, a reproductive exponential dispersion model (EDM)
The minimization problem becomes: $$ \min _{w} \frac{1}{2 n_{\text {samples }}} \sum_{i} d\left(y_{i}, \hat{y}_{i}\right)+\frac{\alpha}{2}\|w\|_{2} $$
where $\alpha$ is the L2 regularization penalty. When sample weights are provided, the average becomes a weighted average.
The following table lists some specific EDMs and their unit deviance (all of these are instances of the Tweedie family):
Distribution | Target Domain | Unit Deviance d(y,yhat) | Power | Regressors |
---|---|---|---|---|
Normal | $y \in(-\infty, \infty)$ | $(y-\hat{y})^{2}$ | 0 | Ridge, ElasticNet |
Poisson | $y \in[0, \infty)$ | $2\left(y \log \frac{y}{\hat{y}}-y+\hat{y}\right)$ | 1 | PoissonRegressor as alias of TweedieRegressor(power=1, link='log') |
Gamma | $y \in(0, \infty)$ | $2\left(\log \frac{\hat{y}}{y}+\frac{y}{\hat{y}}-1\right) $ | 2 | GammaRegressor as alias of TweedieRegressor(power=2, link='log') |
Inverse Gaussian | $y \in(0, \infty)$ | $\frac{(y-\hat{y})^{2}}{y \hat{y}^{2}}$ | 3 | TweedieRegressor(power=3, link='log') |
The choice of the distribution depends on the problem at hand:
If the target values are counts (non-negative integer valued) or relative frequencies (non-negative), you might use a Poisson deviance with log-link.
If the target values are positive valued and skewed, you might try a Gamma deviance with log-link.
If the target values seem to be heavier tailed than a Gamma distribution, you might try an Inverse Gaussian deviance (or even higher variance powers of the Tweedie family).
Examples of use cases include:
Agriculture / weather modeling: number of rain events per year (Poisson), amount of rainfall per event (Gamma), total rainfall per year (Tweedie / Compound Poisson Gamma).
Risk modeling / insurance policy pricing: number of claim events / policyholder per year (Poisson), cost per event (Gamma), total cost per policyholder per year (Tweedie / Compound Poisson Gamma).
Predictive maintenance: number of production interruption events per year (Poisson), duration of interruption (Gamma), total interruption time per year (Tweedie / Compound Poisson Gamma).
TweedieRegressor(*,power=0.0,alpha=1.0,fit_intercept=True,link='auto',
max_iter=100,tol=0.0001,warm_start=False,verbose=0,)
power : float, default=0
The power determines the underlying target distribution according
to the following table:
+-------+------------------------+
| Power | Distribution |
+=======+========================+
| 0 | Normal |
+-------+------------------------+
| 1 | Poisson |
+-------+------------------------+
| (1,2) | Compound Poisson Gamma |
+-------+------------------------+
| 2 | Gamma |
+-------+------------------------+
| 3 | Inverse Gaussian |
+-------+------------------------+
For 0 < power < 1, no distribution exists.
D^2
is a generalization of the coefficient of determination R^2
.R^2
uses squared error and D^2 deviance. Note that those two are equal for family='normal'
.
D^2 is defined as
$D_{n u l}$ is the null deviance, i.e. the deviance of a model with intercept alone, which corresponds to $y_{p r e d}=\bar{y}$.
The mean $\bar{y}$ is averaged by sample_weight. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
from sklearn.linear_model import PoissonRegressor, GammaRegressor, TweedieRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_tweedie_deviance
# TweedieRegressor?
glm_twd = TweedieRegressor(power=1.9, alpha=.1, max_iter=10_000)
glm_twd.fit(X_train, df_train["PurePremium"],
sample_weight=df_train["Exposure"])
TweedieRegressor(alpha=0.1, max_iter=10000, power=1.9)
joblib.dump(glm_twd,"../outputs/glm_twd.joblib")
['../outputs/glm_twd.joblib']
target = 'PurePremium'
y_train = df_train[target].to_numpy().ravel()
y_test = df_test[target].to_numpy().ravel()
tr_D2 = glm_twd.score(X_train,
df_train['PurePremium'],
sample_weight=df_train['Exposure'])
tx_D2 = glm_twd.score(X_test,
df_test['PurePremium'],
sample_weight=df_test['Exposure'])
tr_preds = glm_twd.predict(X_train)
tx_preds = glm_twd.predict(X_test)
tr_mae = mean_absolute_error(y_train,tr_preds)
tx_mae = mean_absolute_error(y_test,tx_preds)
tr_mse = mean_squared_error(y_train, tr_preds)
tx_mse = mean_squared_error(y_test,tx_preds)
df_eval_twd = pd.DataFrame(
{'train': [tr_D2, tr_mae, tr_mse],
'test': [tx_D2, tx_mae, tx_mse]})
df_eval_twd.index = ['D2','mean_absolute_error','mean_squared_error']
df_eval_twd
train | test | |
---|---|---|
D2 | 2.018645e-02 | 1.353285e-02 |
mean_absolute_error | 6.580440e+02 | 4.927505e+02 |
mean_squared_error | 1.478259e+09 | 1.622053e+08 |