Table of Contents

Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.m

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Sampling: No sampling, use all the data.
Tools: Use python module Pycaret for classification.
Question: How many frauds are correctly classified?

Introduction to Pycaret

Pycaret is a high level python module which requires very few lines of code to solve the machine learning problem at hand. This module is useful when dealing with projects with extreme less time constraints.

It has classes like anomaly, classification, clustering, datasets, nlp, preprecess and regression.

Some resources for pycaret:

Imports

Useful Functions

Load the data

Train test split with stratify

Pycaret Setup

setup(data, target, train_size=0.7, sampling=True, sample_estimator=None, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, high_cardinality_method='frequency', numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_outliers=False, outliers_threshold=0.05, remove_multicollinearity=False, multicollinearity_threshold=0.9, create_clusters=False, cluster_iter=20, polynomial_features=False, polynomial_degree=2, trigonometry_features=False, polynomial_threshold=0.1, group_features=None, group_names=None, feature_selection=False, feature_selection_threshold=0.8, feature_interaction=False, feature_ratio=False, interaction_threshold=0.01, session_id=None, silent=False, profile=False)

sampling: bool, default = True
When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance  plot of AUC, Accuracy, Recall, Precision, Kappa and F1 values at various sample  levels, that will assist in deciding the preferred sample size for modeling. 
The desired sample size must then be entered for training and validation in the  pycaret environment. When sample_size entered is less than 1, the remaining dataset  (1 - sample) is used for fitting the model only when finalize_model() is called.

Comparing All Models

compare_models(blacklist = None,fold = 10, round = 4, 
               sort = 'Accuracy',turbo = True)

Create Models


Estimator                   Abbreviated String     Original Implementation 
---------                   ------------------    
Logistic Regression         'lr'                   linear_model.LogisticRegression
K Nearest Neighbour         'knn'                  neighbors.KNeighborsClassifier
Naives Bayes                'nb'                   naive_bayes.GaussianNB
Decision Tree               'dt'                   tree.DecisionTreeClassifier
SVM (Linear)                'svm'                  linear_model.SGDClassifier
SVM (RBF)                   'rbfsvm'               svm.SVC
Gaussian Process            'gpc'                  gaussian_process.GPC
Multi Level Perceptron      'mlp'                  neural_network.MLPClassifier
Ridge Classifier            'ridge'                linear_model.RidgeClassifier
Random Forest               'rf'                   ensemble.RandomForestClassifier
Quadratic Disc. Analysis    'qda'                  discriminant_analysis.QDA
AdaBoost                    'ada'                  ensemble.AdaBoostClassifier
Gradient Boosting           'gbc'                  ensemble.GradientBoostingClassifier
Linear Disc. Analysis       'lda'                  discriminant_analysis.LDA
Extra Trees Classifier      'et'                   ensemble.ExtraTreesClassifier
Extreme Gradient Boosting   'xgboost'              xgboost.readthedocs.io
Light Gradient Boosting     'lightgbm'             github.com/microsoft/LightGBM
CatBoost Classifier         'catboost'             https://catboost.ai

Hyperparameter Tuning

tune_model(estimator=None, fold=10, round=4, n_iter=10, optimize='Accuracy', ensemble=False, method=None, verbose=True)

n_iter: integer, default = 10
Number of iterations within the Random Grid Search. For every iteration, 
the model randomly selects one value from the pre-defined grid of hyperparameters.
ensemble: Boolean, default = None
True enables ensembling of the model through method defined in 'method' param.

method: String, 'Bagging' or 'Boosting', default = None
method comes into effect only when ensemble = True. Default is set to None.

Further tuning

Estimator: Linear Disc. Analysis
Abbreviation: 'lda'
Scikit-learn: discriminant_analysis.LDA
LinearDiscriminantAnalysis(n_components=None,
                           priors=None,
                           shrinkage=None,
                           solver='svd',
                           store_covariance=False,
                           tol=0.0001)
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)

print(clf.predict([[-0.8, -1]]))

Model Evaluation

plot_model(estimator, plot='auc')

Name                        Abbreviated String     
---------                   ------------------ 
Area Under the Curve         'auc'              
Discrimination Threshold     'threshold'
Precision Recall Curve       'pr'
Confusion Matrix             'confusion_matrix'
Class Prediction Error       'error'
Classification Report        'class_report'
Decision Boundary            'boundary'
Recursive Feat. Selection    'rfe' 
Learning Curve               'learning'
Manifold Learning            'manifold'
Calibration Curve            'calibration'
Validation Curve             'vc' 
Dimension Learning           'dimension'
Feature Importance           'feature'
Model Hyperparameter         'parameter'

Ensemble Modelling

ensemble_model(estimator, method='Bagging', fold=10, n_estimators=10, round=4, verbose=True)

method: 'Bagging' or 'Boosting', default = 'Bagging'

Bagging

Boosting

Blending

blend_models(estimator_list='All', fold=10, round=4, method='hard', turbo=True, verbose=True)

Stacking

Stacking is another popular technique for ensembling but is less commonly implemented due to practical difficulties. Stacking is an ensemble learning technique that combines multiple models via a meta-model. Another way to think about stacking is that multiple models are trained to predict the outcome and a meta-model is created that uses the predictions from those models as an input along with the original features.

Selecting which method and models to use in stacking depends on the statistical properties of the dataset. Experimenting with different models and methods is the best way to find out which configuration will work best. However as a general rule of thumb, the models with strong yet diverse performance tend to improve results when used in stacking. One way to measure diversity is the correlation of predictions between models. You can analyze this using the plot parameter.

stack_models(estimator_list, meta_model=None, fold=10, round=4, method='soft', restack=True, plot=False, finalize=False, verbose=True)

Model Calibration

When performing classification you often not only want to predict the class label (outcome such as 0 or 1), but also obtain the probability of the respective outcome which provides a level of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some do not even support probability prediction. Well calibrated classifiers are probabilistic and provide outputs in the form of probabilities that can be directly interpreted as a confidence level. PyCaret allows you to calibrate the probabilities of a given model through the calibrate_model() function.

calibrate_model(estimator, method='sigmoid', fold=10, round=4, verbose=True)

method : string, default = 'sigmoid'
The method to use for calibration. Can be 'sigmoid' which corresponds to Platt's 
method or 'isotonic' which is a non-parametric approach. It is not advised to use
isotonic calibration with too few calibration samples

Model Interpretation

interpret_model(estimator, plot='summary', feature=None, observation=None)

Model Predictions

predict_model(estimator, data=None, probability_threshold=None, 
platform=None, authentication=None)

Model Persistence

Finalize model (Fit whole train data)

Model Evaluation for Test Data

Time taken