Description

In this project, we use the data from kaggle competition Toxic Comment Classification Challenge by Jigsaw and only use the training data. Then we have break this raw training data into train and test data and evaluate the model performances in test data.

The dataset is taken from wikipedia edit text and is classified as one of the following:

  1. toxic
  2. severe_toxic
  3. obscene
  4. threat
  5. insult
  6. identity_hate

This is a multi-label (not-multiclass) classification. One text row has six labels and exactly one label is 1 and other labels are 0.

References:

Load the Libraries

Useful Functions

GPU Testing

Load Training Data

Data Processing: Training Data

Shuffle data

Lowercase

Tokenize using Bert Client

Get the data

Data Preparation for Modelling

Modelling: keras

Focal loss

multilabel focal loss

Modellng: Keras

Model Evaluation

Confusion Matrix

Classification Report

Co-occurence Matrix