Table of Contents

Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Sampling: Synthetic Minority Over-Sampling Technique (SMOTE)
Question: How many frauds are correctly classified?

Remember that Recall = TP / (TP + FN). In case of fraud detection, classifying a fraud as non-fraud (FN) is more risky so we use the metric recall to compare the performances of the models. Higher the recall, better is the model.

The dataset is highly imbalanced. It has 284k non-frauds and 1k frauds. This means out of 1000 transatiosn, 998 are normal and 2 are fraud cases.

Also, we should note that the data is just of two days, we implicitly assume that these two days are represent of the whole trend and reflects the property of the population properly.

The could have been more or less fraudulent transactions in those particular days, but we would not take that into consideration and we generalizes the result. Or, we can say that based on the data from these two days we reached following conclusion and the result is appropriate for the population where the data distribution is similar to that of these two days.

We are more interestd in finding the Fraud cases. i.e. FN (False Negative) cases, predicting fraud as non-fraud is riskier than predicting non-fraud as fraud. So, the suitable metric of model evaluation is RECALL.

In banking, it is always the case that there are a lot of normal transactions, and only few of them are fraudulent. We may train our model with any transformation of the training data, but when testing the model the test set should look like real life, i.e., it has lots of normal cases and very few fraudulent cases.

This means we can train our model using imbalanced or balanced (undersamples or oversampled) but we should test our model on IMBALANCED dataset.

WARNINGS for Serialization:
When using the picked object (serialized object), the machine should have all the same versions of libraries used, such as numpy, pandas, scikit-learn, and all other dependency libraries.
So, to load the serialized object make sure you have the same conda environment as it was when creating the serialized object.
NOTE: When using Logistic Regression for classification problems, we have different solvers in scikit-learn such as `liblinear`, `lbfgs`, `sag`, `saga` and `newton-cg`.
`liblinear` only supports `l1` and `l2` penalty. and `saga` only supports `elasticnet`.

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not handle no penalty.

Imports

Useful Scripts

Load the data

Preprocessing

Class Balance

Correlation with target

Distribution plots

Scaling

Outliers Removal

StratifiedKFold splitting for imbalanced data

Check for nans before modelling

Modelling LR with Imbalanced Data

Plain Logistic Regression

Grid Search for Logistic Regression

Modelling LR with Undersampling Balanced Data

Plain Logistic Regression for Undersampling

Logistic Regression Train on Undersample, Test in Imbalanced

Grid Search for Logistic Regression with Undersampling

Modelling LR with Oversampling SMOTE

oversampling using SMOTE during cross validattion using Logistic Regression

Train Oversample SMOTE, Test Imbalanced

Polynomial Regression Train SMOTE, Test Imbalanced

Serialize the model object and dump to a file

Model Evaluation Metrics

Scalar Classification Metrics

Classification Report

Confusion Matrix

Area Under ROC

Interactive model evaluation plots