The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.
Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.
Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Remember that Recall = TP / (TP + FN)
. In case of fraud detection,
classifying a fraud as
non-fraud (FN) is more risky so we use the metric recall
to compare the
performances of the models. Higher the recall, better is the model.
The dataset is highly imbalanced. It has 284k non-frauds and 1k frauds. This means out of 1000 transatiosn, 998 are normal and 2 are fraud cases.
Also, we should note that the data is just of two days, we implicitly assume that these two days are represent of the whole trend and reflects the property of the population properly.
The could have been more or less fraudulent transactions in those particular days, but we would not take that into consideration and we generalizes the result. Or, we can say that based on the data from these two days we reached following conclusion and the result is appropriate for the population where the data distribution is similar to that of these two days.
We are more interestd in finding the Fraud cases. i.e. FN (False Negative) cases, predicting fraud as non-fraud is riskier than predicting non-fraud as fraud. So, the suitable metric of model evaluation is RECALL.
In banking, it is always the case that there are a lot of normal transactions, and only few of them are fraudulent. We may train our model with any transformation of the training data, but when testing the model the test set should look like real life, i.e., it has lots of normal cases and very few fraudulent cases.
This means we can train our model using imbalanced or balanced (undersamples or oversampled) but we should test our model on IMBALANCED dataset.
from bhishan.util_model_eval import get_binary_classification_scalar_metrics
from bhishan.util_model_eval import print_confusion_matrix_frauds
from bhishan.util_model_eval import plot_confusion_matrix_plotly
/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/sklearn/externals/ DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six ( "(", DeprecationWarning)
# smote
from imblearn.over_sampling import SMOTE
from imblearn.metrics import classification_report_imbalanced
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
# sklearn scalar metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
# multiple metrics
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_fscore_support
# roc auc and curves
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
# confusion matrix and classification report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
df = pd.read_csv('../data/raw/',compression='zip')
(284807, 31)
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | 0.624501 | 0.066084 | 0.717293 | -0.165946 | 2.345865 | -2.890083 | 1.109969 | -0.121359 | -2.261857 | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | 0.753074 | -0.822843 | 0.538196 | 1.345852 | -1.119670 | 0.175121 | -0.451449 | -0.237033 | -0.038195 | 0.803487 | 0.408542 | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
0 284315 1 492 Name: Class, dtype: int64
0 998.272514 1 1.727486 Name: Class, dtype: float64
target = 'Class'
df_corr = df.drop(target,1).corrwith(df[target]).sort_values() = (12, 8), title = "Correlation with class",
fontsize = 12,rot = 90, grid = True,
# v17 14 12 and 10 has correlation more than 0.2
df_corr.loc[ abs(df_corr.values)>0.2]
V17 -0.326481 V14 -0.302544 V12 -0.260593 V10 -0.216883 dtype: float64
high_corr_idx = df_corr.loc[ abs(df_corr.values)>0.2].index.values.tolist()
['V17', 'V14', 'V12', 'V10']
def dist_plot():
for c in high_corr_idx:
# to reduce skewness, we can we boxcox transform.
# RobustScaler is less prone to outliers.
scaler = RobustScaler()
df['scaled_amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = scaler.fit_transform(df['Time'].values.reshape(-1,1))
# Find outliers using IQR method
q1 = df[high_corr_idx].quantile(0.25)
q3 = df[high_corr_idx].quantile(0.75)
iqr = q3 - q1
threshold = 1.5
cond1 = df[high_corr_idx] < (q1 - threshold * iqr)
cond2 = df[high_corr_idx] > (q3 + threshold * iqr)
cond = cond1 | cond2
idx_no_outliers = df[high_corr_idx][~(cond).any(axis=1)].index
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df_no_outliers = df.loc[idx_no_outliers]
df.shape, df_no_outliers.shape
((284807, 33), (250883, 33))
array(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class', 'scaled_amount', 'scaled_time'], dtype=object)
target = 'Class'
features_no_log = df.columns.difference(['Amount','Time','Class']).values.tolist()
features_with_log = df.columns.difference(['scaled_amount','scaled_time','Class']).values.tolist()
idx = idx_no_outliers
cols = features_with_log
df_X = df.loc[idx,cols]
df_y = df.loc[idx,target]
skf = StratifiedKFold(n_splits=5, random_state=SEED, shuffle=True)
for idx_tr, idx_tx in skf.split(df_X, df_y):
df_Xtrain, df_Xtest = df_X.iloc[idx_tr], df_X.iloc[idx_tx]
df_ytrain, df_ytest = df_y.iloc[idx_tr], df_y.iloc[idx_tx]
# for imbalanced data, we use stratified k fold splitting
# class proportion are maintained same in train and test
0 0.999841 1 0.000159 0 0.999860 1 0.000140 Name: Class, dtype: float64
0 284315 1 492 Name: Class, dtype: int64
# we need numpy arrays for stratified splitting.
Xtrain = df_Xtrain.values
Xtest = df_Xtest.values
ytrain = df_ytrain.values
ytest = df_ytest.values
from sklearn.metrics import classification_report, recall_score
# define classifer
clf_lr = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
# fit the classifier,ytrain)
# get the prediction
ypreds_lr = clf_lr.predict(Xtest)
# model eval
recall = recall_score(ytest,ypreds_lr)
report = classification_report(ytest,ypreds_lr)
print(f'Recall Logistic Regression {recall: .2f}')
# I got ZERO recall, all the frauds are classified as non-frauds
# the model overfitted since there are too many non-frauds cases.
Recall Logistic Regression 0.00 precision recall f1-score support 0 1.00 1.00 1.00 50168 1 0.00 0.00 0.00 7 accuracy 1.00 50175 macro avg 0.50 0.50 0.50 50175 weighted avg 1.00 1.00 1.00 50175
from bhishan.util_model_eval import get_binary_classification_scalar_metrics
df_eval = get_binary_classification_scalar_metrics(
"Logistic Regression",
desc="Train Test Imbalanced", df_eval=None)
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
from bhishan.util_model_eval import get_binary_classification_report
df_clf_report = get_binary_classification_report("Logistic Regression",
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
0 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
print(classification_report(ytest, ypreds_lr))
precision recall f1-score support 0 1.00 1.00 1.00 50168 1 0.00 0.00 0.00 7 accuracy 1.00 50175 macro avg 0.50 0.50 0.50 50175 weighted avg 1.00 1.00 1.00 50175
from bhishan.util_model_eval import print_confusion_matrix_frauds
print_confusion_matrix_frauds("Logistic Regression", ytest,ypreds_lr)
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 50,168 | 0 | 7 | 0 | 7 | 0.00% |
Fraud | 7 | 0 | 7 | 0 | 7 | 0.00% |
confusion_matrix(ytest, ypreds_lr)
array([[50168, 0], [ 7, 0]])
def do_grid_search(clf, params,Xtrain,ytrain,scoring='recall'):
"""Grid Search Cross Validation for given classifier.
1. Use scoring = 'recall' for fraud detection, patient detection like
situations where FN (False Negative) is more important.
2. Use scoring = 'precision' for spam email detection like cases
where FP (False Positive) is more important.
from sklearn.model_selection import GridSearchCV
t0 = time.time()
grid = GridSearchCV(clf, params,cv=5,n_jobs=-1,verbose=2,scoring=scoring), ytrain)
clf_best = grid.best_estimator_
t1 = time.time() - t0
print('Time taken: {} minutes {:.2f} seconds'.format(*divmod(t1,60)))
return clf_best
# Logistic Regression with Grid search
# Time taken: 4 min 31 secs
t0 = time.time()
clf_lr_grid = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
params_lr_grid = {"penalty": ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clf_lr_grid = do_grid_search(clf_lr_grid, params_lr_grid,
t1 = time.time() - t0
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(t1,60)))
Fitting 5 folds for each of 14 candidates, totalling 70 fits
Time taken: 2.0 minutes 46.13 seconds Time taken: 2 min 46 secs
ypreds_lr_grid = clf_lr_grid.predict(Xtest)
recall_grid = recall_score(ytest, ypreds_lr_grid)
report_grid = classification_report(ytest,ypreds_lr_grid)
print(f'Recall LR Grid search {recall_grid: .2f}')
# Even after grid search, I got recall 0 for fraud cases.
# The model is still overfitting.
# There are
# Some possible ways to handle this are:
# 1. random undersampling (this makes dataset about 1k from 285k)
# 2. random oversampling using SMOTE (this makes dataset about 284*2k from 285k)
Recall LR Grid search 0.00 precision recall f1-score support 0 1.00 1.00 1.00 50168 1 0.00 0.00 0.00 7 accuracy 1.00 50175 macro avg 0.50 0.50 0.50 50175 weighted avg 1.00 1.00 1.00 50175
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression',
desc="Train Test Imbalanced, Grid Search", df_eval=df_eval)
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
df_clf_report = get_binary_classification_report("Logistic Regression",
desc='Train Test Imbalanced, Grid Search',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
0 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
print(classification_report(ytest,ypreds_lr_grid ))
precision recall f1-score support 0 1.00 1.00 1.00 50168 1 0.00 0.00 0.00 7 accuracy 1.00 50175 macro avg 0.50 0.50 0.50 50175 weighted avg 1.00 1.00 1.00 50175
print_confusion_matrix_frauds("Train Test Imbalanced, Grid Search",
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 50,168 | 0 | 7 | 0 | 7 | 0.00% |
Fraud | 7 | 0 | 7 | 0 | 7 | 0.00% |
confusion_matrix(ytest, ypreds_lr_grid)
array([[50168, 0], [ 7, 0]])
target = 'Class'
n = df[target].value_counts().values[-1]
df_under = (df.groupby(target)
.apply(lambda x: x.sample(n,random_state=SEED))
1 492 0 492 Name: Class, dtype: int64
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class', 'scaled_amount', 'scaled_time'], dtype='object')
features_with_log = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',
'scaled_amount', 'scaled_time']
Xtrain_under,Xtest_under,ytrain_under,ytest_under = \
from sklearn.metrics import classification_report, recall_score
# define classifer
clf_lr_under = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
# fit the classifier,ytrain_under)
# get the prediction
ypreds_lr_under = clf_lr_under.predict(Xtest_under) ## ** Test on Undersample**
# model eval
recall_under = recall_score(ytest_under,ypreds_lr_under)
report_under = classification_report(ytest_under,ypreds_lr_under)
print(f'Recall: Train Test Undersample {recall_under: .2f}')
# Now we have much small dataset, but much better recall scores.
Recall: Train Test Undersample 0.93 precision recall f1-score support 0 0.93 0.96 0.95 99 1 0.96 0.93 0.94 98 accuracy 0.94 197 macro avg 0.94 0.94 0.94 197 weighted avg 0.94 0.94 0.94 197
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression',
desc="Train Test Undersample", df_eval=df_eval)
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
1 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
2 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
df_clf_report = get_binary_classification_report("Logistic Regression",
desc='Train Test Undersample',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
2 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99 | 98 |
0 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
precision recall f1-score support 0 0.93 0.96 0.95 99 1 0.96 0.93 0.94 98 accuracy 0.94 197 macro avg 0.94 0.94 0.94 197 weighted avg 0.94 0.94 0.94 197
print_confusion_matrix_frauds('Train Test Undersample',
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 95 | 4 | 98 | 91 | 7 | 92.86% |
Fraud | 7 | 91 | 98 | 91 | 7 | 92.86% |
array([[95, 4], [ 7, 91]])
Xtest.shape, Xtest_under.shape
((50175, 30), (197, 30))
# define classifer
clf_lr_under_imb = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
# fit the classifier,ytrain_under)
# get the prediction
ypreds_lr_under_imb = clf_lr_under.predict(Xtest) ## ** Test on Imbalanced**
# model eval
recall_under_imb = recall_score(ytest,ypreds_lr_under_imb)
report_under_imb = classification_report(ytest,ypreds_lr_under_imb)
print(f'Recall: Train Underample, Test Imbalanced {recall_under_imb: .2f}')
# Now we have much small dataset, but much better recall scores.
Recall: Train Underample, Test Imbalanced 0.00 precision recall f1-score support 0 1.00 1.00 1.00 50168 1 0.00 0.00 0.00 7 accuracy 1.00 50175 macro avg 0.50 0.50 0.50 50175 weighted avg 1.00 1.00 1.00 50175
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression',
desc="Train Undersample, Test Imbalanced", df_eval=df_eval)
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
1 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
2 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
3 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999482 | 0 | 0 | 0 | -0.000229906 | -0.000203943 | 0.000139512 | 0.492705 |
df_clf_report = get_binary_classification_report("Logistic Regression",
desc='Train Undersample, Test Imbalanced',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
2 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99 | 98 |
0 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
3 | Logistic Regression | Train Undersample, Test Imbalanced | 0.99986 | 0 | 0.999621 | 0 | 0.999741 | 0 | 50168 | 7 |
precision recall f1-score support 0 1.00 1.00 1.00 50168 1 0.00 0.00 0.00 7 accuracy 1.00 50175 macro avg 0.50 0.50 0.50 50175 weighted avg 1.00 1.00 1.00 50175
print_confusion_matrix_frauds('Train Undersample, Test Undersample',
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 50,149 | 19 | 7 | 0 | 7 | 0.00% |
Fraud | 7 | 0 | 7 | 0 | 7 | 0.00% |
array([[50149, 19], [ 7, 0]])
# Grid Search for Logistic Regression with Undersampling
clf_lr_under_grid = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
params_lr_under_grid = {"penalty": ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clf_lr_under_grid = do_grid_search(clf_lr_under_grid, params_lr_under_grid,
Fitting 5 folds for each of 14 candidates, totalling 70 fits
Time taken: 0.0 minutes 1.97 seconds
ypreds_lr_under_grid = clf_lr_under_grid.predict(Xtest_under)
recall_under_grid = recall_score(ytest_under, ypreds_lr_under_grid)
report_under_grid = classification_report(ytest_under,ypreds_lr_under_grid)
print(f'Recall: Train Test Undersample, Grid Search {recall_under_grid: .2f}')
# recall for fraud is 0.92 and for non-fraud is 0.96 for undersample only
# recall for fraud is 0.92 and for non-fraud is 0.98 for undersample grid search
Recall: Train Test Undersample, Grid Search 0.94 precision recall f1-score support 0 0.94 0.91 0.92 99 1 0.91 0.94 0.92 98 accuracy 0.92 197 macro avg 0.92 0.92 0.92 197 weighted avg 0.92 0.92 0.92 197
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression',
desc="Train Test Undersample, Grid Search", df_eval=df_eval)
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.923858 | 0.910891 | 0.938776 | 0.924623 | 0.848129 | 0.847735 | 0.985611 | 0.97918 |
1 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
2 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
3 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
4 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999482 | 0 | 0 | 0 | -0.000229906 | -0.000203943 | 0.000139512 | 0.492705 |
df_clf_report = get_binary_classification_report("Logistic Regression",
desc='Train Test Undersample, Grid Search',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
4 | Logistic Regression | Train Test Undersample, Grid Search | 0.9375 | 0.910891 | 0.909091 | 0.938776 | 0.923077 | 0.924623 | 99 | 98 |
2 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99 | 98 |
0 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
3 | Logistic Regression | Train Undersample, Test Imbalanced | 0.99986 | 0 | 0.999621 | 0 | 0.999741 | 0 | 50168 | 7 |
precision recall f1-score support 0 0.94 0.91 0.92 99 1 0.91 0.94 0.92 98 accuracy 0.92 197 macro avg 0.92 0.92 0.92 197 weighted avg 0.92 0.92 0.92 197
print_confusion_matrix_frauds("Train Test Undersample, Grid Search",
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 90 | 9 | 98 | 92 | 6 | 93.88% |
Fraud | 6 | 92 | 98 | 92 | 6 | 93.88% |
array([[90, 9], [ 6, 92]])
# Overampling and cross validation
def modelling_smote_lr_cross_validation(fname_pkl):
import io
import joblib
# Time taken 45.0 mins 55.83 seconds
t0 = time.time()
# metrics lists
accuracy_lst, precision_lst,recall_lst,f1_lst,auc_lst = [], [], [], [], []
# randomized classifier
clf_lr_params = {"penalty": ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# liblinear supports l1 and l2 penalty, lbfgs does not.
# liblinear does not have n_jobs but lbfgs has it.
clf_lr = LogisticRegression(solver='liblinear',
clf_lr_sm_rand = RandomizedSearchCV(clf_lr,
n_iter=4, # change this to 10
# for fraud detection recall is important
# strafified kfold gives train and test index for a set of (X,y)
for idx_tr, idx_tx in skf.split(Xtrain, ytrain):
# make pipeline from smote and randomized classifier
# first do smote oversampling
# then do randomized search cv
# NOTE: we can add standard scaling as the first step, but our values are
# already scaled.
pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'),
# fit the pipeline to get model using train index
model =[idx_tr], ytrain[idx_tr])
# after fitting, get best estimator
best_est = clf_lr_sm_rand.best_estimator_
# After fitting using train index, we get accuracies using test index.
# accuracy from pipeline
# prediction from randomized best estimator
prediction = best_est.predict(Xtrain[idx_tx])
# scores from prediction
m1 = precision_score(ytrain[idx_tx], prediction)
m2 = recall_score(ytrain[idx_tx], prediction)
m3 = f1_score(ytrain[idx_tx], prediction)
m4 = roc_auc_score(ytrain[idx_tx], prediction)
# append scores to list
# Save the outputs to a dataframe
df_scores_smote = pd.DataFrame({'accuracy': accuracy_lst,
'precision': precision_lst,
'recall': recall_lst,
'f1-score': f1_lst
df_scores_smote.loc[:,'mean'] = df_scores_smote.mean(axis=1)
y_score = best_est.decision_function(Xtest)
average_precision = average_precision_score(ytest, y_score)
df_scores_smote.loc[:,'average_precision_score'] = average_precision
# classification report
ypreds_smote = best_est.predict(Xtest)
report = classification_report(ytest, ypreds_smote,
target_names=['No Fraud','Fraud'])
df_report_smote = pd.read_csv(io.StringIO(report),sep=r'\s\s+',engine='python')
# save the model to a file
joblib.dump(best_est, fname_pkl)
t1 = time.time() - t0
print('Time taken {} mins {:.2f} seconds'.format(*divmod(t1,60)))
# Run this code only once, it takes 45 minutes to run.
fname_pkl = '../models/serialization/logistic_regression_smote.pkl'
clf_lr_smote = joblib.load(fname_pkl)
ypreds_smote = clf_lr_smote.predict(Xtest)
report = classification_report(ytest, ypreds_smote,
target_names=['No Fraud','Fraud'])
precision recall f1-score support No Fraud 1.00 0.89 0.94 50168 Fraud 0.00 0.43 0.00 7 accuracy 0.89 50175 macro avg 0.50 0.66 0.47 50175 weighted avg 1.00 0.89 0.94 50175
df_scores_smote = pd.read_csv("../reports/csv/smote_cv_metrics.csv")
Unnamed: 0 | 0 | 1 | 2 | 3 | 4 | mean | average_precision_score | |
0 | accuracy | 0.142857 | 0.285714 | 0.666667 | 0.666667 | 0.500000 | 0.452381 | 0.005842 |
1 | precision | 0.000292 | 0.000540 | 0.001043 | 0.000840 | 0.000671 | 0.000677 | 0.005842 |
2 | recall | 0.142857 | 0.285714 | 0.666667 | 0.666667 | 0.500000 | 0.452381 | 0.005842 |
3 | f1-score | 0.000582 | 0.001078 | 0.002083 | 0.001678 | 0.001340 | 0.001352 | 0.005842 |
df_report_smote = pd.read_csv('../reports/csv/smote_cv_classification_report.csv')
Unnamed: 0 | precision | recall | f1-score | support | |
0 | No Fraud | 1.00 | 0.89 | 0.94 | 50168.0 |
1 | Fraud | 0.00 | 0.43 | 0.00 | 7.0 |
2 | accuracy | 0.89 | 50175.00 | NaN | NaN |
3 | macro avg | 0.50 | 0.66 | 0.47 | 50175.0 |
4 | weighted avg | 1.00 | 0.89 | 0.94 | 50175.0 |
smote = SMOTE(ratio='minority', random_state=SEED)
Xtrain_smote, ytrain_smote = smote.fit_sample(Xtrain, ytrain)
Xtrain.shape, Xtrain_smote.shape
((200708, 30), (401352, 30))
df_smote = pd.DataFrame(data=np.c_[Xtrain_smote,ytrain_smote],
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | scaled_amount | scaled_time | Class | |
0 | 149.62 | 0.0 | -1.359807 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | -0.072781 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.0 |
1 | 2.69 | 0.0 | 1.191857 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | 0.266151 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | 0.0 |
2 | 123.50 | 1.0 | -0.966272 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.185226 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | 0.0 |
3 | 3.67 | 2.0 | -0.425966 | -0.371407 | 1.341262 | 0.359894 | -0.358091 | -0.137134 | 0.517617 | 0.401726 | -0.058133 | 0.068653 | -0.033194 | 0.960523 | 0.084968 | -0.208254 | -0.559825 | -0.026398 | -0.371427 | -0.232794 | 0.105915 | 0.253844 | 0.081080 | 1.141109 | -0.168252 | 0.420987 | -0.029728 | 0.476201 | 0.260314 | -0.568671 | 0.0 |
4 | 4.99 | 4.0 | 1.229658 | -0.099254 | -1.416907 | -0.153826 | -0.751063 | 0.167372 | 0.050144 | -0.443587 | 0.002821 | -0.611987 | -0.045575 | 0.141004 | -0.219633 | -0.167716 | -0.270710 | -0.154104 | -0.780055 | 0.750137 | -0.257237 | 0.034507 | 0.005168 | 0.045371 | 1.202613 | 0.191881 | 0.272708 | -0.005159 | 0.081213 | 0.464960 | 0.0 |
1.0 200676 0.0 200676 Name: Class, dtype: int64
clf_lr_smote = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
# fit the model, ytrain_smote)
# get the prediction on original Xtest
ypreds_lr_smote = clf_lr_smote.predict(Xtest)
# model eval
recall_smote = recall_score(ytest,ypreds_lr_smote)
report_smote = classification_report(ytest,ypreds_lr_smote)
print(f'Recall SMOTE {recall_smote: .2f}')
Recall SMOTE 0.43 precision recall f1-score support 0 1.00 0.80 0.89 50168 1 0.00 0.43 0.00 7 accuracy 0.80 50175 macro avg 0.50 0.61 0.44 50175 weighted avg 1.00 0.80 0.89 50175
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression',
desc="Train Oversample SMOTE, Test Imbalanced", df_eval=df_eval)
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.923858 | 0.910891 | 0.938776 | 0.924623 | 0.848129 | 0.847735 | 0.985611 | 0.97918 |
1 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
2 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.796831 | 0.00029432 | 0.428571 | 0.000588235 | 0.00661815 | 0.00030949 | 0.00142026 | 0.715063 |
3 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
4 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
5 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999482 | 0 | 0 | 0 | -0.000229906 | -0.000203943 | 0.000139512 | 0.492705 |
df_clf_report = get_binary_classification_report("Logistic Regression",
desc='Train Oversample SMOTE, Test Imbalanced',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
4 | Logistic Regression | Train Test Undersample, Grid Search | 0.9375 | 0.910891 | 0.909091 | 0.938776 | 0.923077 | 0.924623 | 99 | 98 |
2 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99 | 98 |
5 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.9999 | 0.00029432 | 0.796882 | 0.428571 | 0.886922 | 0.000588235 | 50168 | 7 |
0 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
3 | Logistic Regression | Train Undersample, Test Imbalanced | 0.99986 | 0 | 0.999621 | 0 | 0.999741 | 0 | 50168 | 7 |
precision recall f1-score support 0 1.00 0.80 0.89 50168 1 0.00 0.43 0.00 7 accuracy 0.80 50175 macro avg 0.50 0.61 0.44 50175 weighted avg 1.00 0.80 0.89 50175
print_confusion_matrix_frauds('Logistic Regression Oversampling SMOTE',
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 39,978 | 10,190 | 7 | 3 | 4 | 42.86% |
Fraud | 4 | 3 | 7 | 3 | 4 | 42.86% |
array([[39978, 10190], [ 4, 3]])
# Here, we take parameters from grid search of undersampled data.
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=4000, multi_class='warn', n_jobs=1, penalty='l1', random_state=100, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
clf_lr_smote_grid_from_under = LogisticRegression(C=0.01, class_weight=None,
dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=4000,
multi_class='warn', n_jobs=1, penalty='l1', random_state=100,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
# fit the classifier on oversampled data
# Time taken: 0 min 16 secs
t0 = time.time(),ytrain_smote)
t1 = time.time() - t0
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(t1,60)))
Time taken: 0 min 11 secs
ypreds_lr_smote_grid_from_under = clf_lr_smote_grid_from_under.predict(Xtest)
recall_smote_grid_from_under = recall_score(ytest, ypreds_lr_smote_grid_from_under)
report_smote_grid_from_under = classification_report(ytest,ypreds_lr_smote_grid_from_under)
0.42857142857142855 precision recall f1-score support 0 1.00 0.88 0.93 50168 1 0.00 0.43 0.00 7 accuracy 0.88 50175 macro avg 0.50 0.65 0.47 50175 weighted avg 1.00 0.88 0.93 50175
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression',
desc="Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample",
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.923858 | 0.910891 | 0.938776 | 0.924623 | 0.848129 | 0.847735 | 0.985611 | 0.97918 |
1 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
2 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.796831 | 0.00029432 | 0.428571 | 0.000588235 | 0.00661815 | 0.00030949 | 0.00142026 | 0.715063 |
3 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.875775 | 0.000481386 | 0.428571 | 0.000961693 | 0.0109009 | 0.000683173 | 0.00369781 | 0.730713 |
4 | Logistic Regression Polynomial deg 2 | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.875775 | 0.000481386 | 0.428571 | 0.000961693 | 0.0109009 | 0.000683173 | 0.00369781 | 0.730713 |
5 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
6 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
7 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999482 | 0 | 0 | 0 | -0.000229906 | -0.000203943 | 0.000139512 | 0.492705 |
df_clf_report = get_binary_classification_report("Logistic Regression",
desc='Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.9375 | 0.910891 | 0.909091 | 0.938776 | 0.923077 | 0.924623 | 99 | 98 |
1 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99 | 98 |
2 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.9999 | 0.00029432 | 0.796882 | 0.428571 | 0.886922 | 0.000588235 | 50168 | 7 |
3 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.999909 | 0.000481386 | 0.875837 | 0.428571 | 0.93377 | 0.000961693 | 50168 | 7 |
4 | Logistic Regression Polynomial deg 2 | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.999909 | 0.000481386 | 0.875837 | 0.428571 | 0.93377 | 0.000961693 | 50168 | 7 |
5 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
6 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
7 | Logistic Regression | Train Undersample, Test Imbalanced | 0.99986 | 0 | 0.999621 | 0 | 0.999741 | 0 | 50168 | 7 |
print(classification_report(ytest, ypreds_lr_smote_grid_from_under))
precision recall f1-score support 0 1.00 0.88 0.93 50168 1 0.00 0.43 0.00 7 accuracy 0.88 50175 macro avg 0.50 0.65 0.47 50175 weighted avg 1.00 0.88 0.93 50175
print_confusion_matrix_frauds("Logistic Regression Oversampling \
SMOTE Grid Search from Undersampling",
ytest, ypreds_lr_smote_grid_from_under)
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 43,939 | 6,229 | 7 | 3 | 4 | 42.86% |
Fraud | 4 | 3 | 7 | 3 | 4 | 42.86% |
confusion_matrix(ytest, ypreds_lr_smote_grid_from_under)
array([[43939, 6229], [ 4, 3]])
smote = SMOTE(ratio='minority', random_state=SEED)
Xtrain_smote, ytrain_smote = smote.fit_sample(Xtrain, ytrain)
poly = PolynomialFeatures(2)
Xtrain_smote_poly = poly.fit_transform(Xtrain_smote)
Xtrain.shape, Xtrain_smote.shape, Xtrain_smote_poly.shape
((200708, 30), (401352, 30), (401352, 496))
clf_lr_smote_poly2 = LogisticRegression(C=0.01, class_weight=None,
dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=4000,
multi_class='warn', n_jobs=1, penalty='l1', random_state=100,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
# fit the classifier on oversampled data
# Time taken: Time taken: 28 min 49 secs
fname_lr_smote_poly2_pkl = '../models/serialization/logistic_regression_smote_poly2.pkl'
clf_lr_smote_poly2 = joblib.load(fname_lr_smote_poly2_pkl)
# fit the classifier on oversampled data
# Time taken: 0 min 16 secs
t0 = time.time(),ytrain_smote)
t1 = time.time() - t0
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(t1,60)))
Time taken: 0 min 11 secs
ypreds_lr_smote_poly2 = clf_lr_smote_poly2.predict(Xtest)
recall_lr_smote_poly2 = recall_score(ytest, ypreds_lr_smote_poly2)
report_lr_smote_poly2 = classification_report(ytest,ypreds_lr_smote_poly2)
0.42857142857142855 precision recall f1-score support 0 1.00 0.88 0.93 50168 1 0.00 0.43 0.00 7 accuracy 0.88 50175 macro avg 0.50 0.65 0.47 50175 weighted avg 1.00 0.88 0.93 50175
df_eval = get_binary_classification_scalar_metrics(
'Logistic Regression Polynomial deg 2',
desc="Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample",
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.923858 | 0.910891 | 0.938776 | 0.924623 | 0.848129 | 0.847735 | 0.985611 | 0.97918 |
1 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
2 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.796831 | 0.00029432 | 0.428571 | 0.000588235 | 0.00661815 | 0.00030949 | 0.00142026 | 0.715063 |
3 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.875775 | 0.000481386 | 0.428571 | 0.000961693 | 0.0109009 | 0.000683173 | 0.00369781 | 0.730713 |
4 | Logistic Regression Polynomial deg 2 | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.875775 | 0.000481386 | 0.428571 | 0.000961693 | 0.0109009 | 0.000683173 | 0.00369781 | 0.730713 |
5 | Logistic Regression | Train Test Imbalanced | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.00305379 | 0.624277 |
6 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 0 | 0 | 0 | 0 | 0.000155879 | 0.466225 |
7 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999482 | 0 | 0 | 0 | -0.000229906 | -0.000203943 | 0.000139512 | 0.492705 |
df_clf_report = get_binary_classification_report("Logistic Regression Polynomial deg 2",
desc='Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample',
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.9375 | 0.910891 | 0.909091 | 0.938776 | 0.923077 | 0.924623 | 99 | 98 |
1 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99 | 98 |
2 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.9999 | 0.00029432 | 0.796882 | 0.428571 | 0.886922 | 0.000588235 | 50168 | 7 |
3 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.999909 | 0.000481386 | 0.875837 | 0.428571 | 0.93377 | 0.000961693 | 50168 | 7 |
4 | Logistic Regression Polynomial deg 2 | Train Oversample SMOTE, Test Imbalanced, Grid Search from Undersample | 0.999909 | 0.000481386 | 0.875837 | 0.428571 | 0.93377 | 0.000961693 | 50168 | 7 |
5 | Logistic Regression | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 | |
6 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.99986 | 0 | 1 | 0 | 0.99993 | 0 | 50168 | 7 |
7 | Logistic Regression | Train Undersample, Test Imbalanced | 0.99986 | 0 | 0.999621 | 0 | 0.999741 | 0 | 50168 | 7 |
print(classification_report(ytest, ypreds_lr_smote_poly2))
precision recall f1-score support 0 1.00 0.88 0.93 50168 1 0.00 0.43 0.00 7 accuracy 0.88 50175 macro avg 0.50 0.65 0.47 50175 weighted avg 1.00 0.88 0.93 50175
print_confusion_matrix_frauds("Logistic Regression Oversampling \
SMOTE Grid Search from Undersampling",
ytest, ypreds_lr_smote_poly2)
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 43,939 | 6,229 | 7 | 3 | 4 | 42.86% |
Fraud | 4 | 3 | 7 | 3 | 4 | 42.86% |
confusion_matrix(ytest, ypreds_lr_smote_poly2)
array([[43939, 6229], [ 4, 3]])
Model | Description | Accuracy | Precision | Recall | F1 | Mathews Correlation Coefficient | Cohens Kappa | Area Under Precision Curve | Area Under ROC Curve | |
0 | Logistic Regression | Train Test Undersample, Grid Search | 0.923858 | 0.910891 | 0.938776 | 0.924623 | 0.848129 | 0.847735 | 0.985611 | 0.979180 |
1 | Logistic Regression | Train Test Undersample | 0.944162 | 0.957895 | 0.928571 | 0.943005 | 0.888717 | 0.888305 | 0.991175 | 0.989281 |
2 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.796831 | 0.000294 | 0.428571 | 0.000588 | 0.006618 | 0.000309 | 0.001420 | 0.715063 |
3 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced, Grid ... | 0.875775 | 0.000481 | 0.428571 | 0.000962 | 0.010901 | 0.000683 | 0.003698 | 0.730713 |
4 | Logistic Regression Polynomial deg 2 | Train Oversample SMOTE, Test Imbalanced, Grid ... | 0.875775 | 0.000481 | 0.428571 | 0.000962 | 0.010901 | 0.000683 | 0.003698 | 0.730713 |
5 | Logistic Regression | Train Test Imbalanced | 0.999860 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.003054 | 0.624277 |
6 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.999860 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000156 | 0.466225 |
7 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999482 | 0.000000 | 0.000000 | 0.000000 | -0.000230 | -0.000204 | 0.000140 | 0.492705 |
Model | Description | Precision_0 | Precision_1 | Recall_0 | Recall_1 | F1_Score_0 | F1_Score_1 | Support_0 | Support_1 | |
4 | Logistic Regression | Train Test Undersample, Grid Search | 0.937500 | 0.910891 | 0.909091 | 0.938776 | 0.923077 | 0.924623 | 99.0 | 98.0 |
2 | Logistic Regression | Train Test Undersample | 0.931373 | 0.957895 | 0.959596 | 0.928571 | 0.945274 | 0.943005 | 99.0 | 98.0 |
5 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced | 0.999900 | 0.000294 | 0.796882 | 0.428571 | 0.886922 | 0.000588 | 50168.0 | 7.0 |
6 | Logistic Regression | Train Oversample SMOTE, Test Imbalanced, Grid ... | 0.999909 | 0.000481 | 0.875837 | 0.428571 | 0.933770 | 0.000962 | 50168.0 | 7.0 |
7 | Logistic Regression Polynomial deg 2 | Train Oversample SMOTE, Test Imbalanced, Grid ... | 0.999909 | 0.000481 | 0.875837 | 0.428571 | 0.933770 | 0.000962 | 50168.0 | 7.0 |
0 | Logistic Regression | 0.999860 | 0.000000 | 1.000000 | 0.000000 | 0.999930 | 0.000000 | 50168.0 | 7.0 | |
1 | Logistic Regression | Train Test Imbalanced, Grid Search | 0.999860 | 0.000000 | 1.000000 | 0.000000 | 0.999930 | 0.000000 | 50168.0 | 7.0 |
3 | Logistic Regression | Train Undersample, Test Imbalanced | 0.999860 | 0.000000 | 0.999621 | 0.000000 | 0.999741 | 0.000000 | 50168.0 | 7.0 |
print_confusion_matrix_frauds("Logistic Regression Oversampling \
SMOTE Grid Search from Undersampling",
ytest, ypreds_lr_smote_poly2)
Predicted_No_Fraud | Predicted_Fraud | Total_Frauds | Correct_Frauds | Incorrect_Frauds | Fraud_Detection | |
No_Fraud | 43,939 | 6,229 | 7 | 3 | 4 | 42.86% |
Fraud | 4 | 3 | 7 | 3 | 4 | 42.86% |
plot_confusion_matrix_plotly(ytest, ypreds_lr_smote_poly2)
confusion_matrix(ytest, ypreds_lr)
array([[50168, 0], [ 7, 0]])
idx = idx_no_outliers
cols = features_with_log
X = df.loc[idx,cols].values
y = df.loc[idx,target].values
clf_lr = LogisticRegression(solver='liblinear',
n_jobs=1) # for liblinear n_jobs is +1.
plot_roc_skf(clf_lr, X,y,random_state=random_state)
yscore_lr = clf_lr.decision_function(Xtest)
ofile = '../reports/html/logistic_regression_model_evaluation.html'
