Table of Contents

Introduction to Project

References:

In this project we detect whether the given sample of medical data corresponds to cancer cell or not. The data has 33 features and the target feature is diagnosis.

Imports

Useful Scripts

Load the data

Data Manipulation

Exploratory Data Analysis

Data Preparation for Modelling

Correlation

Modelling: Boosting Xgboost

default xgboost

Remove correlated features

Recursive Feature Elimination

HPO: GridSearch

Important Parameters:

learning_rate: step size shrinkage used to prevent overfitting. Range is [0,1]
max_depth: determines how deeply each tree is allowed to grow during any boosting round.
subsample: percentage of samples used per tree. Low value can lead to underfitting.
colsample_bytree: percentage of features used per tree. High value can lead to overfitting.
n_estimators: number of trees you want to build.
Regularization parameters:

gamma: controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. Supported only for tree-based learners.
alpha: L1 regularization on leaf weights. A large value leads to more regularization.
lambda: L2 regularization on leaf weights and is smoother than L1 regularization.

HPO: Hyperopt

Model Evaluation

Model Interpretation

Time taken