The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.
Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.
Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.m
Pycaret is a high level python module which requires very few lines of code to solve the machine learning problem at hand. This module is useful when dealing with projects with extreme less time constraints.
It has classes like anomaly, classification, clustering, datasets,
nlp, preprecess and regression
.
Some resources for pycaret:
import time
time_start_notebook = time.time()
import numpy as np
import pandas as pd
import seaborn as sns
import os
from pathlib import Path
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot')
# random state
SEED = 0
RNG = np.random.RandomState(SEED)
home = os.path.expanduser('~')
[(x.__name__,x.__version__) for x in [np,pd,sns]]
from zipfile import ZipFile
%%capture
import sys
ENV_COLAB = 'google.colab' in sys.modules
if ENV_COLAB:
!pip install ipywidgets
!pip install pycaret
!jupyter nbextension enable --py widgetsnbextension
from pycaret.utils import enable_colab
enable_colab()
# set OMP_NUM_THREADS=1 for hpsklearn package
#!export OMP_NUM_THREADS=1
print('Environment: Google Colab')
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.6/dist-packages (7.5.1)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets) (3.5.1)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.6/dist-packages (from ipywidgets) (5.5.0)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets) (4.10.1)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets) (4.3.3)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets) (5.0.6)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets) (5.2.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.7.5)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (1.0.18)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (46.4.0)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.8.1)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (4.4.2)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (2.1.3)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (4.8.0)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets) (4.5.3)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets) (5.3.4)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets) (1.12.0)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets) (0.2.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets) (4.6.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets) (2.6.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.11.2)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (5.6.1)
Requirement already satisfied: terminado>=0.3.3; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.8.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.1.9)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets) (0.6.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (2.8.1)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (19.0.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.1.1)
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.4.4)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.4.2)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.8.4)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.6.0)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.3)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (3.1.5)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.5.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (20.4)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.4.7)
Requirement already satisfied: pycaret in /usr/local/lib/python3.6/dist-packages (1.0.0)
Requirement already satisfied: pyLDAvis in /usr/local/lib/python3.6/dist-packages (from pycaret) (2.1.2)
Requirement already satisfied: plotly==4.4.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (4.4.1)
Requirement already satisfied: catboost==0.20.2 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.20.2)
Requirement already satisfied: pandas-profiling==2.3.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (2.3.0)
Requirement already satisfied: kmodes==0.10.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.10.1)
Requirement already satisfied: scikit-learn==0.22 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.22)
Requirement already satisfied: mlxtend in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.14.0)
Requirement already satisfied: IPython in /usr/local/lib/python3.6/dist-packages (from pycaret) (5.5.0)
Requirement already satisfied: umap-learn in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.4.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.18.4)
Requirement already satisfied: datefinder==0.7.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.7.0)
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.6/dist-packages (from pycaret) (7.5.1)
Requirement already satisfied: lightgbm==2.3.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (2.3.1)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.5.0)
Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (from pycaret) (2.2.4)
Requirement already satisfied: yellowbrick==1.0.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.0.1)
Requirement already satisfied: awscli in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.18.69)
Requirement already satisfied: xgboost==0.90 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.90)
Requirement already satisfied: DateTime==4.3 in /usr/local/lib/python3.6/dist-packages (from pycaret) (4.3)
Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.10.1)
Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.6.0)
Requirement already satisfied: textblob in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.15.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.2.1)
Requirement already satisfied: cufflinks==0.17.0 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.17.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from pycaret) (1.0.3)
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from pycaret) (3.2.5)
Requirement already satisfied: pyod in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.8.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.15.1)
Requirement already satisfied: shap==0.32.1 in /usr/local/lib/python3.6/dist-packages (from pycaret) (0.32.1)
Requirement already satisfied: scipy>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (1.4.1)
Requirement already satisfied: pytest in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (3.6.4)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (0.16.0)
Requirement already satisfied: numexpr in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (2.7.1)
Requirement already satisfied: funcy in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (1.14)
Requirement already satisfied: wheel>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (0.34.2)
Requirement already satisfied: jinja2>=2.7.2 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis->pycaret) (2.11.2)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly==4.4.1->pycaret) (1.3.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from plotly==4.4.1->pycaret) (1.12.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost==0.20.2->pycaret) (0.10.1)
Requirement already satisfied: phik>=0.9.8 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.3.0->pycaret) (0.10.0)
Requirement already satisfied: confuse>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.3.0->pycaret) (1.1.0)
Requirement already satisfied: htmlmin>=0.1.12 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.3.0->pycaret) (0.1.12)
Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.3.0->pycaret) (0.4.2)
Requirement already satisfied: astropy in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.3.0->pycaret) (4.0.1.post1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from mlxtend->pycaret) (46.4.0)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (4.4.2)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (0.8.1)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (4.8.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (2.1.3)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (1.0.18)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (0.7.5)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from IPython->pycaret) (4.3.3)
Requirement already satisfied: tbb in /usr/local/lib/python3.6/dist-packages (from umap-learn->pycaret) (2020.0.133)
Requirement already satisfied: numba!=0.47,>=0.46 in /usr/local/lib/python3.6/dist-packages (from umap-learn->pycaret) (0.48.0)
Requirement already satisfied: pytz in /usr/local/lib/python3.6/dist-packages (from datefinder==0.7.0->pycaret) (2018.9)
Requirement already satisfied: python-dateutil>=2.4.2 in /usr/local/lib/python3.6/dist-packages (from datefinder==0.7.0->pycaret) (2.8.1)
Requirement already satisfied: regex>=2017.02.08 in /usr/local/lib/python3.6/dist-packages (from datefinder==0.7.0->pycaret) (2019.12.20)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (5.0.6)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (4.10.1)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->pycaret) (3.5.1)
Requirement already satisfied: pillow in /usr/local/lib/python3.6/dist-packages (from wordcloud->pycaret) (7.0.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (2.0.3)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.0.2)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (7.4.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (0.6.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.0.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (0.4.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (2.23.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.0.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (3.0.2)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (1.1.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy->pycaret) (4.41.1)
Requirement already satisfied: cycler>=0.10.0 in /usr/local/lib/python3.6/dist-packages (from yellowbrick==1.0.1->pycaret) (0.10.0)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from awscli->pycaret) (0.3.3)
Requirement already satisfied: rsa<=3.5.0,>=3.1.2 in /usr/local/lib/python3.6/dist-packages (from awscli->pycaret) (3.4.2)
Requirement already satisfied: colorama<0.4.4,>=0.2.5; python_version != "3.4" in /usr/local/lib/python3.6/dist-packages (from awscli->pycaret) (0.4.3)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from awscli->pycaret) (0.15.2)
Requirement already satisfied: PyYAML<5.4,>=3.10; python_version != "3.4" in /usr/local/lib/python3.6/dist-packages (from awscli->pycaret) (3.13)
Requirement already satisfied: botocore==1.16.19 in /usr/local/lib/python3.6/dist-packages (from awscli->pycaret) (1.16.19)
Requirement already satisfied: zope.interface in /usr/local/lib/python3.6/dist-packages (from DateTime==4.3->pycaret) (5.1.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim->pycaret) (2.0.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pycaret) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->pycaret) (1.2.0)
Requirement already satisfied: colorlover>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from cufflinks==0.17.0->pycaret) (0.3.0)
Requirement already satisfied: chart-studio>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from cufflinks==0.17.0->pycaret) (1.1.0)
Requirement already satisfied: combo in /usr/local/lib/python3.6/dist-packages (from pyod->pycaret) (0.1.0)
Requirement already satisfied: suod in /usr/local/lib/python3.6/dist-packages (from pyod->pycaret) (0.0.4)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis->pycaret) (1.4.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis->pycaret) (1.8.1)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis->pycaret) (0.7.1)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis->pycaret) (19.3.0)
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis->pycaret) (8.3.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.7.2->pyLDAvis->pycaret) (1.1.1)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->IPython->pycaret) (0.6.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython->pycaret) (0.1.9)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->IPython->pycaret) (0.2.0)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba!=0.47,>=0.46->umap-learn->pycaret) (0.31.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (4.6.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets->pycaret) (2.6.0)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (5.3.4)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets->pycaret) (4.5.3)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.2.2)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy->pycaret) (1.6.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy->pycaret) (1.24.3)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from rsa<=3.5.0,>=3.1.2->awscli->pycaret) (0.4.8)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from botocore==1.16.19->awscli->pycaret) (0.10.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim->pycaret) (1.13.13)
Requirement already satisfied: boto in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim->pycaret) (2.49.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->pycaret) (19.0.1)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (5.6.1)
Requirement already satisfied: terminado>=0.3.3; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.3)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy->pycaret) (3.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.6.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (1.4.2)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (3.1.5)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.3)
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.4.4)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.8.4)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (0.5.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->pycaret) (20.4)
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: OK
Colab mode activated.
Environment: Google Colab
import pycaret
from pycaret.utils import version
import pycaret.classification as pyc
version()
1.0.0
def compare_new_models(name,desc,mean_row,df_eval=None,sort='Accuracy',show=True):
"""Create dataframe from output of pycaret new model.
Parameters
-----------
name: str
Name of the model. eg. xgboost
desc: str
Description of the model. e.g tuned,calibrated
mean_row: str
The line copied from jupyter notebook output from
pycaret new model. Note that fields are separated
with tabs.
e.g.
Mean 0.9992 0.9663 0.7214 0.8299 0.7679 0.7675
df_eval: Pandas Dataframe
Template pandas dataframe
sort: str
One of following string: Accuracy, AUC, Recall, Precision
F1, Kappa
Returns:
Pandas Dataframe.
"""
mean_row_lst = mean_row.split('\t')
assert len(mean_row_lst) == 7
if not isinstance(df_eval, pd.DataFrame):
df_eval = pd.DataFrame({'Model': [],
'Description':[],
'Accuracy':[],
'AUC':[],
'Recall':[],
'Precision':[],
'F1':[],
'Kappa':[]
})
acc,auc,rec,pre,f1,kap = mean_row.split('\t')[1:]
row = [name,desc,acc,auc,rec,pre,f1,kap]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates()\
.sort_values(sort,ascending=False)
df_eval.index = range(len(df_eval))
df_style = (df_eval.style.apply(lambda ser:
['background: lightblue'
if ser.name == sort else ''
for _ in ser]))
if show:
display(df_style)
return df_eval
ifile = 'https://github.com/bhishanpdl/Datasets/blob/master/fraud_detection/creditcard.csv.zip?raw=true'
df = pd.read_csv(ifile,compression='zip')
print(df.shape)
df.head()
(284807, 31)
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | 0.624501 | 0.066084 | 0.717293 | -0.165946 | 2.345865 | -2.890083 | 1.109969 | -0.121359 | -2.261857 | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | 0.753074 | -0.822843 | 0.538196 | 1.345852 | -1.119670 | 0.175121 | -0.451449 | -0.237033 | -0.038195 | 0.803487 | 0.408542 | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
#ifile = Path.home() / 'Datasets/kaggle/creditcard/creditcard.csv.zip'
#zip_file = ZipFile(ifile)
#df = pd.read_csv(zip_file.open('creditcard.csv'))
#print(df.shape)
#df.head()
target = 'Class'
features = df.columns.drop(target)
df[target].value_counts(normalize=True)*100
0 99.827251 1 0.172749 Name: Class, dtype: float64
sns.countplot(df[target])
<matplotlib.axes._subplots.AxesSubplot at 0x7f34aa890d68>
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(
df,test_size=0.2, random_state=SEED,
stratify=df[target])
print(df_train.shape)
df_train.head()
(227845, 31)
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
211885 | 138616.0 | -1.137612 | 2.345154 | -1.767247 | 0.833982 | 0.973168 | -0.073571 | 0.802433 | 0.733137 | -1.154087 | -0.520340 | 0.494117 | 0.799935 | 0.494576 | -0.479666 | -0.917177 | -0.184117 | 1.189459 | 0.937244 | 0.960749 | 0.062820 | 0.114953 | 0.430613 | -0.240819 | 0.124011 | 0.187187 | -0.402251 | 0.196277 | 0.190732 | 39.46 | 0 |
12542 | 21953.0 | -1.028649 | 1.141569 | 2.492561 | -0.242233 | 0.452842 | -0.384273 | 1.256026 | -0.816401 | 1.964560 | -0.014216 | 0.432153 | -2.140921 | 2.274477 | 0.114128 | -1.652894 | -0.617302 | 0.243791 | -0.426168 | -0.493177 | 0.350032 | -0.380356 | -0.037432 | -0.503934 | 0.407129 | 0.604252 | 0.233015 | -0.433132 | -0.491892 | 7.19 | 0 |
270932 | 164333.0 | -1.121864 | -0.195099 | 1.282634 | -3.172847 | -0.761969 | -0.287013 | -0.586367 | 0.496182 | -2.352349 | 0.350551 | -1.319688 | -0.942001 | 1.082210 | -0.425735 | 0.036748 | 0.380392 | -0.033353 | 0.204609 | -0.801465 | -0.113632 | -0.328953 | -0.856937 | -0.056198 | 0.401905 | 0.406813 | -0.440140 | 0.152356 | 0.030128 | 40.00 | 0 |
30330 | 35874.0 | 1.094238 | -0.760568 | -0.392822 | -0.611720 | -0.722850 | -0.851978 | -0.185505 | -0.095131 | -1.122304 | 0.367009 | 1.378493 | -0.724216 | -1.105406 | -0.480170 | 0.220826 | 1.745743 | 0.740817 | -0.728827 | 1.016740 | 0.354148 | -0.227392 | -1.254285 | 0.022116 | -0.141531 | 0.114515 | -0.652427 | -0.037897 | 0.051254 | 165.85 | 0 |
272477 | 165107.0 | 2.278095 | -1.298924 | -1.884035 | -1.530435 | -0.649500 | -0.996024 | -0.466776 | -0.438025 | -1.612665 | 1.631133 | -1.126000 | -0.938760 | 0.300621 | -0.119667 | -0.585453 | -1.106244 | 0.690235 | -0.124401 | -0.075649 | -0.341708 | 0.123892 | 0.815909 | -0.072537 | 0.784217 | 0.403428 | 0.193747 | -0.043185 | -0.058719 | 60.00 | 0 |
setup
function will automatically infer datatypes, but if we
see wrong data inference, we can use parameters numeric_features
and categorical_features
.setup(data, target, train_size=0.7, sampling=True, sample_estimator=None, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, high_cardinality_method='frequency', numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_outliers=False, outliers_threshold=0.05, remove_multicollinearity=False, multicollinearity_threshold=0.9, create_clusters=False, cluster_iter=20, polynomial_features=False, polynomial_degree=2, trigonometry_features=False, polynomial_threshold=0.1, group_features=None, group_names=None, feature_selection=False, feature_selection_threshold=0.8, feature_interaction=False, feature_ratio=False, interaction_threshold=0.01, session_id=None, silent=False, profile=False)
sampling: bool, default = True
When the sample size exceeds 25,000 samples, pycaret will build a base estimator at various sample sizes from the original dataset. This will return a performance plot of AUC, Accuracy, Recall, Precision, Kappa and F1 values at various sample levels, that will assist in deciding the preferred sample size for modeling.
The desired sample size must then be entered for training and validation in the pycaret environment. When sample_size entered is less than 1, the remaining dataset (1 - sample) is used for fitting the model only when finalize_model() is called.
experiment_may31_2020 = pyc.setup(df_train,'Class',
train_size=0.8,
session_id=SEED,
sampling= True,
silent=True,
profile=False
)
# use silent = True to check inferred datatypes
# then assign numeric and categorical features yourself.
#
# if sampling = False, 100% of data is used and its too slow
# if sampling = True, we need to enter number eg. 0.3 ourself.
"""
Here, we have data > 25k rows, so I have chosen 0.3 or 30% of data
for validation purposes.
""";
Setup Succesfully Completed!
Description | Value | |
---|---|---|
0 | session_id | 100 |
1 | Target Type | Binary |
2 | Label Encoded | None |
3 | Original Data | (227845, 31) |
4 | Missing Values | False |
5 | Numeric Features | 30 |
6 | Categorical Features | 0 |
7 | Ordinal Features | False |
8 | High Cardinality Features | False |
9 | High Cardinality Method | None |
10 | Sampled Data | (68353, 31) |
11 | Transformed Train Set | (54682, 30) |
12 | Transformed Test Set | (13671, 30) |
13 | Numeric Imputer | mean |
14 | Categorical Imputer | constant |
15 | Normalize | False |
16 | Normalize Method | None |
17 | Transformation | False |
18 | Transformation Method | None |
19 | PCA | False |
20 | PCA Method | None |
21 | PCA Components | None |
22 | Ignore Low Variance | False |
23 | Combine Rare Levels | False |
24 | Rare Level Threshold | None |
25 | Numeric Binning | False |
26 | Remove Outliers | False |
27 | Outliers Threshold | None |
28 | Remove Multicollinearity | False |
29 | Multicollinearity Threshold | None |
30 | Clustering | False |
31 | Clustering Iteration | None |
32 | Polynomial Features | False |
33 | Polynomial Degree | None |
34 | Trignometry Features | False |
35 | Polynomial Threshold | None |
36 | Group Features | False |
37 | Feature Selection | False |
38 | Features Selection Threshold | None |
39 | Feature Interaction | False |
40 | Feature Ratio | False |
41 | Interaction Threshold | None |
compare_models(blacklist = None,fold = 10, round = 4,
sort = 'Accuracy',turbo = True)
pyc.compare_models(sort = 'Recall',fold=5)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|---|
0 | Quadratic Discriminant Analysis | 0.988500 | 0.939300 | 0.860200 | 0.118900 | 0.208500 | 0.206100 |
1 | Linear Discriminant Analysis | 0.999500 | 0.983300 | 0.776000 | 0.921700 | 0.842300 | 0.842000 |
2 | Extreme Gradient Boosting | 0.999500 | 0.973500 | 0.765500 | 0.912000 | 0.830700 | 0.830400 |
3 | Extra Trees Classifier | 0.999500 | 0.933600 | 0.764900 | 0.964700 | 0.849800 | 0.849600 |
4 | Decision Tree Classifier | 0.999300 | 0.872100 | 0.744400 | 0.860500 | 0.792500 | 0.792200 |
5 | CatBoost Classifier | 0.999500 | 0.973100 | 0.744400 | 0.975000 | 0.841900 | 0.841700 |
6 | Random Forest Classifier | 0.999400 | 0.924700 | 0.712300 | 0.944400 | 0.807900 | 0.807600 |
7 | Logistic Regression | 0.999100 | 0.924500 | 0.702300 | 0.769500 | 0.731800 | 0.731400 |
8 | Ada Boost Classifier | 0.999200 | 0.941700 | 0.680700 | 0.843200 | 0.749300 | 0.748900 |
9 | Gradient Boosting Classifier | 0.999100 | 0.862200 | 0.670800 | 0.807000 | 0.730100 | 0.729700 |
10 | Naive Bayes | 0.992100 | 0.960800 | 0.595900 | 0.128100 | 0.210000 | 0.207700 |
11 | Ridge Classifier | 0.998900 | 0.000000 | 0.414600 | 0.890600 | 0.558700 | 0.558200 |
12 | Light Gradient Boosting Machine | 0.995900 | 0.526600 | 0.171300 | 0.164500 | 0.153500 | 0.151800 |
13 | K Neighbors Classifier | 0.998300 | 0.572200 | 0.021100 | 0.400000 | 0.040000 | 0.039900 |
14 | SVM - Linear Kernel | 0.998200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.000200 |
Estimator Abbreviated String Original Implementation
--------- ------------------
Logistic Regression 'lr' linear_model.LogisticRegression
K Nearest Neighbour 'knn' neighbors.KNeighborsClassifier
Naives Bayes 'nb' naive_bayes.GaussianNB
Decision Tree 'dt' tree.DecisionTreeClassifier
SVM (Linear) 'svm' linear_model.SGDClassifier
SVM (RBF) 'rbfsvm' svm.SVC
Gaussian Process 'gpc' gaussian_process.GPC
Multi Level Perceptron 'mlp' neural_network.MLPClassifier
Ridge Classifier 'ridge' linear_model.RidgeClassifier
Random Forest 'rf' ensemble.RandomForestClassifier
Quadratic Disc. Analysis 'qda' discriminant_analysis.QDA
AdaBoost 'ada' ensemble.AdaBoostClassifier
Gradient Boosting 'gbc' ensemble.GradientBoostingClassifier
Linear Disc. Analysis 'lda' discriminant_analysis.LDA
Extra Trees Classifier 'et' ensemble.ExtraTreesClassifier
Extreme Gradient Boosting 'xgboost' xgboost.readthedocs.io
Light Gradient Boosting 'lightgbm' github.com/microsoft/LightGBM
CatBoost Classifier 'catboost' https://catboost.ai
# pyc.create_model?
# build the xgboost model
xgb = pyc.create_model('xgboost')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9991 | 0.9162 | 0.5000 | 1.0000 | 0.6667 | 0.6663 |
1 | 0.9996 | 0.9855 | 0.9000 | 0.9000 | 0.9000 | 0.8998 |
2 | 0.9993 | 0.9966 | 0.7778 | 0.7778 | 0.7778 | 0.7774 |
3 | 0.9989 | 0.9516 | 0.5556 | 0.7143 | 0.6250 | 0.6245 |
4 | 0.9996 | 0.9449 | 0.7778 | 1.0000 | 0.8750 | 0.8748 |
5 | 0.9993 | 0.9917 | 0.6667 | 0.8571 | 0.7500 | 0.7496 |
6 | 0.9993 | 0.9883 | 0.5556 | 1.0000 | 0.7143 | 0.7139 |
7 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
8 | 0.9998 | 1.0000 | 1.0000 | 0.9091 | 0.9524 | 0.9523 |
9 | 0.9998 | 0.9901 | 0.9000 | 1.0000 | 0.9474 | 0.9473 |
Mean | 0.9995 | 0.9765 | 0.7633 | 0.9158 | 0.8208 | 0.8206 |
SD | 0.0003 | 0.0272 | 0.1774 | 0.0994 | 0.1245 | 0.1246 |
[ i for i in dir(clf_xgb) if i[:2] !='__']
type(xgb)
xgboost.sklearn.XGBClassifier
mean_row = 'Mean 0.9994 0.9585 0.7345 0.9102 0.8047 0.8044'
df_eval = compare_new_models('xgb','default',mean_row)
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
cb = pyc.create_model('catboost')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9993 | 0.9201 | 0.6000 | 1.00 | 0.7500 | 0.7497 |
1 | 0.9996 | 0.9964 | 0.8000 | 1.00 | 0.8889 | 0.8887 |
2 | 0.9995 | 0.9871 | 0.8889 | 0.80 | 0.8421 | 0.8418 |
3 | 0.9995 | 0.9782 | 0.6667 | 1.00 | 0.8000 | 0.7997 |
4 | 0.9996 | 0.9437 | 0.7778 | 1.00 | 0.8750 | 0.8748 |
5 | 0.9995 | 0.9702 | 0.6667 | 1.00 | 0.8000 | 0.7997 |
6 | 0.9996 | 0.9767 | 0.7778 | 1.00 | 0.8750 | 0.8748 |
7 | 1.0000 | 1.0000 | 1.0000 | 1.00 | 1.0000 | 1.0000 |
8 | 1.0000 | 1.0000 | 1.0000 | 1.00 | 1.0000 | 1.0000 |
9 | 0.9998 | 0.9977 | 0.9000 | 1.00 | 0.9474 | 0.9473 |
Mean | 0.9996 | 0.9770 | 0.8078 | 0.98 | 0.8778 | 0.8777 |
SD | 0.0002 | 0.0252 | 0.1318 | 0.06 | 0.0803 | 0.0805 |
mean_row = 'Mean 0.9995 0.9554 0.7345 0.9548 0.8215 0.8212'
df_eval = compare_new_models('cb','default',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
1 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
lda = pyc.create_model('lda')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9995 | 0.9739 | 0.8000 | 0.8889 | 0.8421 | 0.8418 |
1 | 0.9996 | 0.9962 | 0.8000 | 1.0000 | 0.8889 | 0.8887 |
2 | 0.9993 | 0.9873 | 0.7778 | 0.7778 | 0.7778 | 0.7774 |
3 | 0.9993 | 0.9926 | 0.5556 | 1.0000 | 0.7143 | 0.7139 |
4 | 0.9996 | 0.9612 | 0.7778 | 1.0000 | 0.8750 | 0.8748 |
5 | 0.9991 | 0.9772 | 0.6667 | 0.7500 | 0.7059 | 0.7054 |
6 | 0.9991 | 0.9298 | 0.5556 | 0.8333 | 0.6667 | 0.6662 |
7 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
8 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
9 | 0.9996 | 0.9984 | 0.8000 | 1.0000 | 0.8889 | 0.8887 |
Mean | 0.9995 | 0.9817 | 0.7733 | 0.9250 | 0.8359 | 0.8357 |
SD | 0.0003 | 0.0212 | 0.1453 | 0.0979 | 0.1117 | 0.1118 |
mean_row = 'Mean 0.9992 0.9677 0.7255 0.8340 0.7661 0.7657'
df_eval = compare_new_models('lda','default',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
1 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
2 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
tune_model(estimator=None, fold=10, round=4, n_iter=10, optimize='Accuracy', ensemble=False, method=None, verbose=True)
n_iter: integer, default = 10
Number of iterations within the Random Grid Search. For every iteration,
the model randomly selects one value from the pre-defined grid of hyperparameters.
ensemble: Boolean, default = None
True enables ensembling of the model through method defined in 'method' param.
method: String, 'Bagging' or 'Boosting', default = None
method comes into effect only when ensemble = True. Default is set to None.
# pyc.tune_model?
xgb_tuned = pyc.tune_model('xgboost',fold=5,optimize='Recall')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9994 | 0.9067 | 0.6842 | 0.9286 | 0.7879 | 0.7876 |
1 | 0.9993 | 0.9354 | 0.6842 | 0.8667 | 0.7647 | 0.7643 |
2 | 0.9994 | 0.9431 | 0.7222 | 0.8667 | 0.7879 | 0.7876 |
3 | 0.9996 | 0.9888 | 0.7895 | 1.0000 | 0.8824 | 0.8822 |
4 | 0.9996 | 0.9826 | 0.7895 | 1.0000 | 0.8824 | 0.8822 |
Mean | 0.9995 | 0.9513 | 0.7339 | 0.9324 | 0.8210 | 0.8208 |
SD | 0.0002 | 0.0306 | 0.0474 | 0.0597 | 0.0508 | 0.0509 |
mean_row = 'Mean 0.9992 0.9677 0.7255 0.8340 0.7661 0.7657'
df_eval = compare_new_models('xgb_tuned','tuned',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
1 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
2 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
3 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
cb_tuned = pyc.tune_model('catboost',fold=5,optimize='Recall')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9995 | 0.9552 | 0.6842 | 1.0000 | 0.8125 | 0.8122 |
1 | 0.9994 | 0.9659 | 0.7895 | 0.8333 | 0.8108 | 0.8105 |
2 | 0.9995 | 0.9280 | 0.7222 | 1.0000 | 0.8387 | 0.8385 |
3 | 0.9998 | 0.9827 | 0.8947 | 1.0000 | 0.9444 | 0.9444 |
4 | 0.9997 | 0.9978 | 0.8421 | 1.0000 | 0.9143 | 0.9141 |
Mean | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
SD | 0.0002 | 0.0239 | 0.0767 | 0.0667 | 0.0550 | 0.0551 |
mean_row = 'Mean 0.9996 0.9659 0.7865 0.9667 0.8642 0.8639'
df_eval = compare_new_models('cb_tuned','fold=5',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | cb_tuned | fold=5 | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
1 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
2 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
3 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
4 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
lda_tuned = pyc.tune_model('lda',fold=5,optimize='Recall')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9995 | 0.9835 | 0.7895 | 0.9375 | 0.8571 | 0.8569 |
1 | 0.9993 | 0.9906 | 0.6842 | 0.8667 | 0.7647 | 0.7643 |
2 | 0.9994 | 0.9700 | 0.7222 | 0.8667 | 0.7879 | 0.7876 |
3 | 0.9995 | 0.9733 | 0.7895 | 0.9375 | 0.8571 | 0.8569 |
4 | 0.9998 | 0.9991 | 0.8947 | 1.0000 | 0.9444 | 0.9444 |
Mean | 0.9995 | 0.9833 | 0.7760 | 0.9217 | 0.8423 | 0.8420 |
SD | 0.0002 | 0.0108 | 0.0718 | 0.0504 | 0.0630 | 0.0631 |
mean_row = 'Mean 0.9995 0.9833 0.7760 0.9217 0.8423 0.8420'
df_eval = compare_new_models('lda_tuned','fold=5',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | cb_tuned | fold=5 | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
1 | lda_tuned | fold=5 | 0.9995 | 0.9833 | 0.7760 | 0.9217 | 0.8423 | 0.8420 |
2 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
3 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
4 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
5 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
Estimator: Linear Disc. Analysis
Abbreviation: 'lda'
Scikit-learn: discriminant_analysis.LDA
LinearDiscriminantAnalysis(n_components=None,
priors=None,
shrinkage=None,
solver='svd',
store_covariance=False,
tol=0.0001)
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
# lda?
# tune lda
lda_tuned = pyc.tune_model('lda',n_iter=100,fold=10,optimize='Recall')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9995 | 0.9739 | 0.8000 | 0.8889 | 0.8421 | 0.8418 |
1 | 0.9996 | 0.9962 | 0.8000 | 1.0000 | 0.8889 | 0.8887 |
2 | 0.9993 | 0.9873 | 0.7778 | 0.7778 | 0.7778 | 0.7774 |
3 | 0.9993 | 0.9926 | 0.5556 | 1.0000 | 0.7143 | 0.7139 |
4 | 0.9996 | 0.9612 | 0.7778 | 1.0000 | 0.8750 | 0.8748 |
5 | 0.9991 | 0.9772 | 0.6667 | 0.7500 | 0.7059 | 0.7054 |
6 | 0.9991 | 0.9298 | 0.5556 | 0.8333 | 0.6667 | 0.6662 |
7 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
8 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
9 | 0.9996 | 0.9984 | 0.8000 | 1.0000 | 0.8889 | 0.8887 |
Mean | 0.9995 | 0.9817 | 0.7733 | 0.9250 | 0.8359 | 0.8357 |
SD | 0.0003 | 0.0212 | 0.1453 | 0.0979 | 0.1117 | 0.1118 |
mean_row = 'Mean 0.9992 0.9677 0.7255 0.8340 0.7661 0.7657'
df_eval = compare_new_models('lda_tuned','n_iter=100,fold=10',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | cb_tuned | fold=5 | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
1 | lda_tuned | fold=5 | 0.9995 | 0.9833 | 0.7760 | 0.9217 | 0.8423 | 0.8420 |
2 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
3 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
4 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
5 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
6 | lda_tuned | n_iter=100,fold=10 | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
plot_model(estimator, plot='auc')
Name Abbreviated String
--------- ------------------
Area Under the Curve 'auc'
Discrimination Threshold 'threshold'
Precision Recall Curve 'pr'
Confusion Matrix 'confusion_matrix'
Class Prediction Error 'error'
Classification Report 'class_report'
Decision Boundary 'boundary'
Recursive Feat. Selection 'rfe'
Learning Curve 'learning'
Manifold Learning 'manifold'
Calibration Curve 'calibration'
Validation Curve 'vc'
Dimension Learning 'dimension'
Feature Importance 'feature'
Model Hyperparameter 'parameter'
# pyc.plot_model?
# AUC-ROC plot
pyc.plot_model(lda, plot = 'auc')
# confusion matrix
# pyc.plot_model(lda, plot = 'confusion_matrix')
# evaluate model
pyc.evaluate_model(lda)
Parameters | |
---|---|
n_components | None |
priors | None |
shrinkage | None |
solver | svd |
store_covariance | False |
tol | 0.0001 |
ensemble_model(estimator, method='Bagging', fold=10, n_estimators=10, round=4, verbose=True)
method: 'Bagging' or 'Boosting', default = 'Bagging'
# pyc.ensemble_model?
dt = pyc.create_model('dt')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9993 | 0.8000 | 0.6000 | 1.0000 | 0.7500 | 0.7497 |
1 | 0.9991 | 0.9496 | 0.9000 | 0.6923 | 0.7826 | 0.7822 |
2 | 0.9987 | 0.9439 | 0.8889 | 0.5714 | 0.6957 | 0.6950 |
3 | 0.9987 | 0.7775 | 0.5556 | 0.6250 | 0.5882 | 0.5876 |
4 | 0.9991 | 0.8332 | 0.6667 | 0.7500 | 0.7059 | 0.7054 |
5 | 0.9989 | 0.8331 | 0.6667 | 0.6667 | 0.6667 | 0.6661 |
6 | 0.9987 | 0.7775 | 0.5556 | 0.6250 | 0.5882 | 0.5876 |
7 | 0.9996 | 0.9998 | 1.0000 | 0.8182 | 0.9000 | 0.8998 |
8 | 0.9996 | 0.9499 | 0.9000 | 0.9000 | 0.9000 | 0.8998 |
9 | 0.9995 | 0.9498 | 0.9000 | 0.8182 | 0.8571 | 0.8569 |
Mean | 0.9991 | 0.8814 | 0.7633 | 0.7467 | 0.7434 | 0.7430 |
SD | 0.0003 | 0.0805 | 0.1611 | 0.1295 | 0.1101 | 0.1103 |
dt_bagged = pyc.ensemble_model(dt, n_estimators=50,method='Bagging')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9993 | 0.8997 | 0.6000 | 1.0000 | 0.7500 | 0.7497 |
1 | 0.9995 | 0.9497 | 0.9000 | 0.8182 | 0.8571 | 0.8569 |
2 | 0.9995 | 0.9994 | 0.8889 | 0.8000 | 0.8421 | 0.8418 |
3 | 0.9987 | 0.8326 | 0.5556 | 0.6250 | 0.5882 | 0.5876 |
4 | 0.9995 | 0.8885 | 0.7778 | 0.8750 | 0.8235 | 0.8233 |
5 | 0.9996 | 0.8881 | 0.7778 | 1.0000 | 0.8750 | 0.8748 |
6 | 0.9989 | 0.9437 | 0.4444 | 0.8000 | 0.5714 | 0.5709 |
7 | 0.9998 | 1.0000 | 1.0000 | 0.9000 | 0.9474 | 0.9473 |
8 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
9 | 0.9996 | 0.9498 | 0.9000 | 0.9000 | 0.9000 | 0.8998 |
Mean | 0.9994 | 0.9352 | 0.7844 | 0.8718 | 0.8155 | 0.8152 |
SD | 0.0004 | 0.0540 | 0.1824 | 0.1118 | 0.1342 | 0.1344 |
# dt_bagged_tuned = pyc.tune_model('dt', ensemble=True,
# method='Bagging',fold=3, n_iter=10, optimize='Recall')
# dt_boosted_tuned = pyc.tune_model('dt', ensemble=True, method='Boosting')
blend_models(estimator_list='All', fold=10, round=4, method='hard', turbo=True, verbose=True)
# pyc.blend_models?
# blend_soft = pyc.blend_models(estimator_list = [dt, xgb,lda], method = 'soft')
# blend_hard = pyc.blend_models(estimator_list = [dt, xgb,lda], method = 'hard')
Stacking is another popular technique for ensembling but is less commonly implemented due to practical difficulties. Stacking is an ensemble learning technique that combines multiple models via a meta-model. Another way to think about stacking is that multiple models are trained to predict the outcome and a meta-model is created that uses the predictions from those models as an input along with the original features.
Selecting which method and models to use in stacking depends on the statistical properties of the dataset. Experimenting with different models and methods is the best way to find out which configuration will work best. However as a general rule of thumb, the models with strong yet diverse performance tend to improve results when used in stacking. One way to measure diversity is the correlation of predictions between models. You can analyze this using the plot parameter.
stack_models(estimator_list, meta_model=None, fold=10, round=4, method='soft', restack=True, plot=False, finalize=False, verbose=True)
# pyc.stack_models?
# stack_soft = pyc.stack_models(estimator_list = [dt, xgb,lda], method = 'soft')
# stack_soft2 = pyc.stack_models(estimator_list = [xgb, lda],
# method = 'soft',
# meta_model=dt)
# stack_hard = pyc.stack_models(estimator_list = [dt, xgb,lda], method = 'hard')
# stack_soft_plot = pyc.stack_models([dt,xgb,lda], plot=True)
When performing classification you often not only want to predict the class label (outcome such as 0 or 1), but also obtain the probability of the respective outcome which provides a level of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some do not even support probability prediction. Well calibrated classifiers are probabilistic and provide outputs in the form of probabilities that can be directly interpreted as a confidence level. PyCaret allows you to calibrate the probabilities of a given model through the calibrate_model() function.
calibrate_model(estimator, method='sigmoid', fold=10, round=4, verbose=True)
method : string, default = 'sigmoid'
The method to use for calibration. Can be 'sigmoid' which corresponds to Platt's
method or 'isotonic' which is a non-parametric approach. It is not advised to use
isotonic calibration with too few calibration samples
# pyc.calibrate_model?
pyc.plot_model(lda, plot='calibration')
lda_tuned_calibrated = pyc.calibrate_model(lda,fold=5,
method='sigmoid')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9992 | 0.9839 | 0.5789 | 0.9167 | 0.7097 | 0.7093 |
1 | 0.9989 | 0.9868 | 0.4211 | 0.8889 | 0.5714 | 0.5709 |
2 | 0.9992 | 0.9697 | 0.6111 | 0.8462 | 0.7097 | 0.7093 |
3 | 0.9991 | 0.9731 | 0.5263 | 0.9091 | 0.6667 | 0.6662 |
4 | 0.9994 | 0.9991 | 0.6316 | 1.0000 | 0.7742 | 0.7739 |
Mean | 0.9991 | 0.9825 | 0.5538 | 0.9122 | 0.6863 | 0.6859 |
SD | 0.0001 | 0.0105 | 0.0753 | 0.0503 | 0.0669 | 0.0670 |
mean_row = 'Mean 0.9991 0.9825 0.5538 0.9122 0.6863 0.6859'
df_eval = compare_new_models('lda_tuned_calibrated','fold=5',mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | cb_tuned | fold=5 | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
1 | lda_tuned | fold=5 | 0.9995 | 0.9833 | 0.7760 | 0.9217 | 0.8423 | 0.8420 |
2 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
3 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
4 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
5 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
6 | lda_tuned | n_iter=100,fold=10 | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
7 | lda_tuned_calibrated | fold=5 | 0.9991 | 0.9825 | 0.5538 | 0.9122 | 0.6863 | 0.6859 |
lda_tuned_calibrated_iso = pyc.calibrate_model(lda_tuned,
fold=5, method='isotonic')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | |
---|---|---|---|---|---|---|
0 | 0.9995 | 0.9764 | 0.7895 | 0.9375 | 0.8571 | 0.8569 |
1 | 0.9991 | 0.9920 | 0.7368 | 0.7368 | 0.7368 | 0.7364 |
2 | 0.9995 | 0.9616 | 0.7778 | 0.8750 | 0.8235 | 0.8233 |
3 | 0.9995 | 0.9645 | 0.7895 | 0.9375 | 0.8571 | 0.8569 |
4 | 0.9999 | 0.9995 | 0.9474 | 1.0000 | 0.9730 | 0.9729 |
Mean | 0.9995 | 0.9788 | 0.8082 | 0.8974 | 0.8495 | 0.8493 |
SD | 0.0003 | 0.0149 | 0.0722 | 0.0895 | 0.0758 | 0.0759 |
mean_row = 'Mean 0.9995 0.9788 0.8082 0.8974 0.8495 0.8493'
df_eval = compare_new_models('lda_tuned_calibrated_iso','fold=5',
mean_row,df_eval=df_eval,sort='Recall')
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | lda_tuned_calibrated_iso | fold=5 | 0.9995 | 0.9788 | 0.8082 | 0.8974 | 0.8495 | 0.8493 |
1 | cb_tuned | fold=5 | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
2 | lda_tuned | fold=5 | 0.9995 | 0.9833 | 0.7760 | 0.9217 | 0.8423 | 0.8420 |
3 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
4 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
5 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
6 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
7 | lda_tuned | n_iter=100,fold=10 | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
8 | lda_tuned_calibrated | fold=5 | 0.9991 | 0.9825 | 0.5538 | 0.9122 | 0.6863 | 0.6859 |
pyc.plot_model(lda_tuned_calibrated, plot='calibration')
pyc.plot_model(lda_tuned_calibrated_iso, plot='calibration')
interpret_model(estimator, plot='summary', feature=None, observation=None)
# pyc.interpret_model?
df_eval
Model | Description | Accuracy | AUC | Recall | Precision | F1 | Kappa | |
---|---|---|---|---|---|---|---|---|
0 | lda_tuned_calibrated_iso | fold=5 | 0.9995 | 0.9788 | 0.8082 | 0.8974 | 0.8495 | 0.8493 |
1 | cb_tuned | fold=5 | 0.9996 | 0.9659 | 0.7865 | 0.9667 | 0.8642 | 0.8639 |
2 | lda_tuned | fold=5 | 0.9995 | 0.9833 | 0.7760 | 0.9217 | 0.8423 | 0.8420 |
3 | xgb | default | 0.9994 | 0.9585 | 0.7345 | 0.9102 | 0.8047 | 0.8044 |
4 | cb | default | 0.9995 | 0.9554 | 0.7345 | 0.9548 | 0.8215 | 0.8212 |
5 | lda | default | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
6 | xgb_tuned | tuned | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
7 | lda_tuned | n_iter=100,fold=10 | 0.9992 | 0.9677 | 0.7255 | 0.8340 | 0.7661 | 0.7657 |
8 | lda_tuned_calibrated | fold=5 | 0.9991 | 0.9825 | 0.5538 | 0.9122 | 0.6863 | 0.6859 |
# interpret_model: SHAP
pyc.interpret_model(xgb)
# interpret model : Correlation
pyc.interpret_model(xgb,plot='correlation')
predict_model(estimator, data=None, probability_threshold=None,
platform=None, authentication=None)
# pyc.predict_model?
df_test.iloc[:5,-5:]
V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|
248750 | -0.775158 | 0.261012 | 0.058359 | 18.70 | 0 |
161573 | -0.235203 | -0.036910 | -0.227111 | 9.99 | 0 |
65893 | 0.011709 | 0.029830 | -0.080522 | 112.00 | 0 |
12836 | 0.974426 | -0.067625 | 0.007633 | 23.27 | 0 |
132224 | -0.252303 | 0.009928 | 0.015153 | 74.97 | 0 |
df_preds = pyc.predict_model(lda_tuned_calibrated,df_test)
df_preds.iloc[:5,-5:]
V28 | Amount | Class | Label | Score | |
---|---|---|---|---|---|
0 | 0.058359 | 18.70 | 0 | 0 | 0.0008 |
1 | -0.227111 | 9.99 | 0 | 0 | 0.0006 |
2 | -0.080522 | 112.00 | 0 | 0 | 0.0006 |
3 | 0.007633 | 23.27 | 0 | 0 | 0.0006 |
4 | 0.015153 | 74.97 | 0 | 0 | 0.0006 |
if 'google.colab' in sys.modules:
h = ''
else:
h = "../models/"
# pyc.save_model?
# save the model
pyc.save_model(lda_tuned, h + 'lda_tuned_calibrated.pkl')
Transformation Pipeline and Model Succesfully Saved
# load model
lda_tuned_calibrated = pyc.load_model(model_name= h+ 'lda_tuned_calibrated.pkl')
Transformation Pipeline and Model Sucessfully Loaded
# save entire experiment
pyc.save_experiment( h + "experiment_may31_2020")
Experiment Succesfully Saved
saved_experiment = pyc.load_experiment( h+ 'experiment_may31_2020')
Object | |
---|---|
0 | Classification Setup Config |
1 | X_training Set |
2 | y_training Set |
3 | X_test Set |
4 | y_test Set |
5 | Transformation Pipeline |
6 | Compare Models Score Grid |
7 | Extreme Gradient Boosting |
8 | Extreme Gradient Boosting Score Grid |
9 | CatBoost Classifier |
10 | CatBoost Classifier Score Grid |
11 | Linear Discriminant Analysis |
12 | Linear Discriminant Analysis Score Grid |
13 | Tuned XGBClassifier |
14 | Tuned XGBClassifier Score Grid |
15 | Tuned <catboost.core.CatBoostClassifier object... |
16 | Tuned <catboost.core.CatBoostClassifier object... |
17 | Tuned LinearDiscriminantAnalysis |
18 | Tuned LinearDiscriminantAnalysis Score Grid |
19 | Tuned LinearDiscriminantAnalysis |
20 | Tuned LinearDiscriminantAnalysis Score Grid |
21 | Decision Tree |
22 | Decision Tree Score Grid |
23 | BaggingClassifier |
24 | BaggingClassifier Score Grid |
25 | CalibratedClassifierCV |
26 | CalibratedClassifierCV Score Grid |
27 | CalibratedClassifierCV |
28 | CalibratedClassifierCV Score Grid |
type(saved_experiment[0])
pandas.core.frame.DataFrame
final_lda = pyc.finalize_model(lda_tuned)
print(final_lda)
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None, solver='lsqr', store_covariance=False, tol=0.0001)
df_test.iloc[:5,-5:]
V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|
248750 | -0.775158 | 0.261012 | 0.058359 | 18.70 | 0 |
161573 | -0.235203 | -0.036910 | -0.227111 | 9.99 | 0 |
65893 | 0.011709 | 0.029830 | -0.080522 | 112.00 | 0 |
12836 | 0.974426 | -0.067625 | 0.007633 | 23.27 | 0 |
132224 | -0.252303 | 0.009928 | 0.015153 | 74.97 | 0 |
df_preds = pyc.predict_model(final_lda,df_test)
df_preds.head()
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 154078.0 | 0.046622 | 1.529678 | -0.453615 | 1.282569 | 1.110333 | -0.882716 | 1.046420 | -0.117121 | -0.679897 | -0.923709 | 0.371519 | -0.000047 | 0.512255 | -2.091762 | 0.786796 | 0.159652 | 1.706939 | 0.458922 | 0.037665 | 0.240559 | -0.338472 | -0.839547 | 0.066527 | 0.836447 | 0.076790 | -0.775158 | 0.261012 | 0.058359 | 18.70 | 0 | 0 | 0.0 |
1 | 114332.0 | 0.145870 | 0.107484 | 0.755127 | -0.995936 | 1.159107 | 2.113961 | 0.036200 | 0.471777 | 0.627622 | -0.598398 | 0.713816 | 1.091294 | 0.663878 | -0.448057 | 0.146422 | -0.445603 | -0.462439 | -0.373996 | -0.966334 | -0.107332 | 0.297644 | 1.285809 | -0.140560 | -0.910706 | -0.449729 | -0.235203 | -0.036910 | -0.227111 | 9.99 | 0 | 0 | 0.0 |
2 | 51793.0 | -1.434413 | -0.469604 | 1.816518 | 0.650913 | -0.569395 | 0.851560 | -0.796770 | 0.760209 | -1.369018 | 0.086289 | -1.614272 | -0.210000 | 0.959659 | -0.169720 | 1.110324 | -1.636073 | 0.381255 | 1.726039 | -0.137539 | 0.032530 | -0.033991 | 0.017976 | -0.062151 | -0.769157 | 0.291469 | 0.011709 | 0.029830 | -0.080522 | 112.00 | 0 | 0 | 0.0 |
3 | 22542.0 | 1.216532 | -0.314522 | 1.134570 | 0.302071 | -1.047467 | -0.226341 | -0.808963 | 0.011571 | 2.484110 | -0.749128 | -0.113215 | -2.463177 | 1.217232 | 1.078202 | -0.353184 | 0.264467 | 0.560170 | 0.070299 | 0.162726 | -0.101807 | -0.289677 | -0.451358 | 0.021372 | 0.025676 | 0.112433 | 0.974426 | -0.067625 | 0.007633 | 23.27 | 0 | 0 | 0.0 |
4 | 79909.0 | 1.033697 | -0.059268 | 0.169109 | 1.067405 | -0.093840 | 0.106697 | -0.037664 | 0.151611 | -0.096594 | 0.190225 | 1.021172 | 0.253360 | -1.091876 | 0.767247 | 0.641822 | 0.210099 | -0.499796 | 0.233142 | -0.435061 | -0.077871 | 0.158194 | 0.292092 | -0.181584 | -0.318182 | 0.551666 | -0.252303 | 0.009928 | 0.015153 | 74.97 | 0 | 0 | 0.0 |
df_preds['Class'].value_counts()
0 56864 1 98 Name: Class, dtype: int64
ytest = df_preds['Class'].to_numpy().ravel()
ypreds = df_preds['Label'].to_numpy().ravel()
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: ', accuracy_score(ytest,ypreds))
print('Precision: ', precision_score(ytest,ypreds))
print('Recall: ', recall_score(ytest,ypreds))
print('F1-score: ', f1_score(ytest,ypreds))
Accuracy: 0.9992977774656788 Precision: 0.8452380952380952 Recall: 0.7244897959183674 F1-score: 0.7802197802197802
# confusion matrix and classification report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(classification_report(ytest,ypreds))
# look at the value of recall for 1. It must be nearer to 1.
precision recall f1-score support 0 1.00 1.00 1.00 56864 1 0.85 0.72 0.78 98 accuracy 1.00 56962 macro avg 0.92 0.86 0.89 56962 weighted avg 1.00 1.00 1.00 56962
confusion_matrix(ytest,ypreds)
array([[56851, 13], [ 27, 71]])
df_preds['Class'].value_counts()
0 56864 1 98 Name: Class, dtype: int64
df_preds['Class'].value_counts(normalize=True)
0 0.99828 1 0.00172 Name: Class, dtype: float64
"""
There are total 98 frauds in test dataset (20% of full data with seed=100)
Out of which only 50 are correctly classified.
Which is 51% case. We may think that it is extremely bad classifier (like 50/50).
But, when we look at class distribution 99.8% are not frauds, the dumbest classifier
will classify everything as non-frauds.
""";
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
'{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole notebook: 0 hr 51 min 20 secs