Table of Contents

Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Sampling: Synthetic Minority Over-Sampling Technique (SMOTE)
Question: How many frauds are correctly classified?

Remember that Recall = TP / (TP + FN). In case of fraud detection, classifying a fraud as non-fraud (FN) is more risky so we use the metric recall to compare the performances of the models. Higher the recall, better is the model.

The dataset is highly imbalanced. It has 284k non-frauds and 1k frauds. This means out of 1000 transatiosn, 998 are normal and 2 are fraud cases.

Also, we should note that the data is just of two days, we implicitly assume that these two days are represent of the whole trend and reflects the property of the population properly.

The could have been more or less fraudulent transactions in those particular days, but we would not take that into consideration and we generalizes the result. Or, we can say that based on the data from these two days we reached following conclusion and the result is appropriate for the population where the data distribution is similar to that of these two days.

We are more interestd in finding the Fraud cases. i.e. FN (False Negative) cases, predicting fraud as non-fraud is riskier than predicting non-fraud as fraud. So, the suitable metric of model evaluation is RECALL.

In banking, it is always the case that there are a lot of normal transactions, and only few of them are fraudulent. We may train our model with any transformation of the training data, but when testing the model the test set should look like real life, i.e., it has lots of normal cases and very few fraudulent cases.

This means we can train our model using imbalanced or balanced (undersamples or oversampled) but we should test our model on IMBALANCED dataset.

WARNINGS for Serialization:
When using the picked object (serialized object), the machine should have all the same versions of libraries used, such as numpy, pandas, scikit-learn, and all other dependency libraries.
So, to load the serialized object make sure you have the same conda environment as it was when creating the serialized object.
NOTE: When using Logistic Regression for classification problems, we have different solvers in scikit-learn such as `liblinear`, `lbfgs`, `sag`, `saga` and `newton-cg`.
`liblinear` only supports `l1` and `l2` penalty. and `saga` only supports `elasticnet`.

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not handle no penalty.

Imports

Useful Scripts

Load the data

EDA

Feature Engineering

Temporal Variables

Bucketizing Numerical Features

Dummy Variables for Binned and Categorical Features

Log Transform Large Numerical Features

We generally perform log or boxcox transformation of features with large number to make it look like more Gaussian.

Combine All Features to One Column

Feature Scaling

Train Test Splitting

Modelling

Logistic Regression

Model Evaluation

Predictions

Model Evaluation Using ml.evaluation

https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/evaluation.html

Binary Evualuator metrics: areaUnderROC|areaUnderPR
Multiclass Evaluator metrics: f1|weightedPrecision|weightedRecall|accuracy

KNN Classifier

Evaluator for Clustering results  expects
two input columns: prediction and features.

The metric computes the Silhouette measure using the squared Euclidean distance.

The Silhouette is a measure for the validation of the consistency
within clusters. It ranges between 1 and -1, where a value close to
1 means that the points in a cluster are close to the other points
in the same cluster and far from the points of the other clusters.

Decision Tree Classifier

Random Forest Classifier

Parameter Tuning

Feature Importances

Confusion Matrix