Table of Contents

Introduction

When the data fits perfectly in local computer we should not use dask, instead we should use pandas. But when data is larger than RAM (eg. > 16GB) we can use dask. Pandas may take upto 10 times the RAM than the data size. (e.g if data is 2GB, pandas may crash on a computer of 16GB and we may need to use dask.

Dask does not have its own data dype, it uses pandas. But many pandas operations are not available in dask. Dask just distributes the tasks among the workers and makes the task lazy and uses DAG to do the computations.

If data is hundreds of GB, then we need to use pyspark not dask. Dask may not support big data (~100GB). We can use spark for any size of data. But spark is based on Scala not python. There is python wrapper pyspark and another module koalas which provide some of the functionalities of spark using python syntax but not all the functionalities. If the code fails we see all the java script errors and its genuinely difficult to debug the code.

References:

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Question: How many frauds are correctly classified?
Method Used: Dask ( which works for big data ~ 1 billion rows)

Here, in this notebook I use the big data analysiz tool called dask.

Dask utilizes multiple cores and perform distrubuted operaton. This uses lazy operations, meaning evaluation is done only if necessary and creates a DAG to perform the operations.

Imports

Useful Scripts

Dask API

Load the data

Data Processing

Train-test Split

Modelling: dask_ml xgboost

Model Evaluation

Improve the xgboost model