Description

In this project, we use the data from kaggle competition Toxic Comment Classification Challenge by Jigsaw and only use the training data. Then we have break this raw training data into train and test data and evaluate the model performances in test data.

The dataset is taken from wikipedia edit text and is classified as one of the following:

  1. toxic
  2. severe_toxic
  3. obscene
  4. threat
  5. insult
  6. identity_hate

This is a multi-label (not-multiclass) classification. One text row has six labels and exactly one label is 1 and other labels are 0.

References:

Load the Libraries

Useful Functions

GPU Testing

Load Training Data

Data Processing: Training Data

Shuffle and create ohe column

Load pretrained tokenizer

Transformers pretrained tokenizers

BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) 

XLNet:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=False) 

RoBERTa:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)

Get Encodings from tokenizer

Find unique ohe indices

Get train validation tensors

Create DataLoader

Load the Model for Sequence Classification

Choose Optimizer

Setting custom optimization parameters for the AdamW optimizer https://huggingface.co/transformers/main_classes/optimizer_schedules.html

Train Model

Load and Preprocess Test Data

Tokenize Test Data

Create Tensors for Test Data

Create DataLoader for Test Data

Get the Predictions from Test Data

Model Evaluation

Confusion Matrix

Classification Report

Co-occurence Matrix

Optimize Micro and Macro F-1 score

Time Taken