Table of Contents

Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.m

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Sampling: No sampling, use all the data.
Tools: Use python module Pycaret for classification.
Question: How many frauds are correctly classified?

Introduction to Boosting

The term Boosting refers to a family of algorithms which converts weak learner to strong learners.

There are many boosting algorithms:

sklearn.ensemble.GradientBoostingRegressor
xgboost.XGBRegressor # fast and best
lightgbm.LGBMRegressor # extreme fast, little acc than xgb
catboost.CatBoostRegressor # good for categorical feats

Colab

Imports

Useful Scripts

Load the data

Train test split with stratify

Train Validation with stratify

Modelling catboost

https://catboost.ai/docs/concepts/python-reference_catboostregressor.html

class CatBoostRegressor(

iterations=None,                 learning_rate=None,
depth=None,                      l2_leaf_reg=None,
model_size_reg=None,             rsm=None,
loss_function='RMSE',            border_count=None,
feature_border_type=None,        per_float_feature_quantization=None,
input_borders=None,              output_borders=None,
fold_permutation_block=None,     od_pval=None,
od_wait=None,                    od_type=None,
nan_mode=None,                   counter_calc_method=None,
leaf_estimation_iterations=None, leaf_estimation_method=None,
thread_count=None,               random_seed=None,
use_best_model=None,             best_model_min_trees=None,
verbose=None,                    silent=None,
logging_level=None,              metric_period=None,
ctr_leaf_count_limit=None,       store_all_simple_ctr=None,
max_ctr_complexity=None,         has_time=None,
allow_const_label=None,          one_hot_max_size=None,
random_strength=None,name=None,  ignored_features=None,
train_dir=None,                  custom_metric=None,
eval_metric=None,                bagging_temperature=None,
save_snapshot=None,              snapshot_file=None,
snapshot_interval=None,          fold_len_multiplier=None,
used_ram_limit=None,             gpu_ram_part=None,
pinned_memory_size=None,         allow_writing_files=None,
final_ctr_computation_mode=None, approx_on_full_history=None,
boosting_type=None,              simple_ctr=None,
combinations_ctr=None,           per_feature_ctr=None,
ctr_target_border_count=None,    task_type=None,
device_config=None,              devices=None,
bootstrap_type=None,             subsample=None,
sampling_unit=None,              dev_score_calc_obj_block_size=None,
max_depth=None,                  n_estimators=None,
num_boost_round=None,            num_trees=None,
colsample_bylevel=None,          random_state=None,
reg_lambda=None,                 objective=None,
eta=None,                        max_bin=None,
gpu_cat_features_storage=None,   data_partition=None,
metadata=None,                   early_stopping_rounds=None,
cat_features=None,               grow_policy=None,
min_data_in_leaf=None,           min_child_samples=None,
max_leaves=None,                 num_leaves=None,
score_function=None,             leaf_estimation_backtracking=None,
ctr_history_unit=None,           monotone_constraints=None
)

Catboost with validation set

Feature Statistics

catboost tutorials model analysis feature statistics tutorial

Feature Importance

catboost using Pool

Cross Validation

cv(pool=None, params=None, dtrain=None, iterations=None, 
num_boost_round=None, fold_count=None, nfold=None, inverted=False,
partition_random_seed=0, seed=None, shuffle=True, logging_level=None,
stratified=None, as_pandas=True, metric_period=None, verbose=None,
verbose_eval=None, plot=False, early_stopping_rounds=None,
save_snapshot=None, snapshot_file=None,
snapshot_interval=None, folds=None, type='Classical')

HPO (Hyper Parameter Optimization)

We generally should optimize model complexity and then tune the convergence.

model complexity: max_depth etc convergence: learning rate

Parameters:

Baseline model

Using Early Stopping from Validation Set

Try Your luck with different random states

HPO Hyper Parameter Optimization with Optuna

Best Model

Model Interpretation

Model interpretation using eli5

Model interpretation using lime

Model Evaluation Using shap

Time Taken