Description

In this project, we use the data from kaggle competition Toxic Comment Classification Challenge by Jigsaw and only use the training data. Then we have break this raw training data into train and test data and evaluate the model performances in test data.

The dataset is taken from wikipedia edit text and is classified as one of the following:

  1. toxic
  2. severe_toxic
  3. obscene
  4. threat
  5. insult
  6. identity_hate

This is a multi-label (not-multiclass) classification. One text row has six labels and exactly one label is 1 and other labels are 0.

References:

Deep Learning NLP

Transformers Models

Notes:

  1. As we are not using RNN, we have to limit the sequence length to the model input size.
  2. Most of the models require special tokens placed at the beginning and end of the sequences.
  3. Some models like RoBERTa require a space to start the input string. For those models, the encoding methods should be called with add_prefix_space set to True.

bert:       [CLS] + tokens + [SEP] + padding

roberta:    [CLS] + prefix_space + tokens + [SEP] + padding

distilbert: [CLS] + tokens + [SEP] + padding

xlm:        [CLS] + tokens + [SEP] + padding

xlnet:      padding + tokens + [SEP] + [CLS]

Load the Libraries

Useful Functions

GPU Testing

Load Training Data

Data Processing: Training Data

Shuffle and create ohe column

Choose Transformers Model

Load pretrained tokenizer

Transformers pretrained tokenizers

BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) 

XLNet:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=False) 

RoBERTa:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)

DistilBert:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base', do_lower_case=False)
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base', do_lower_case=False)

transformers.MyTokenizer.from_pretrained

from_pretrained(pretrained_model_name_or_path, *init_inputs, **kwargs)

Distilbert hugging face: https://huggingface.co/transformers/model_doc/distilbert.html

transformers.DistilBertConfig(
vocab_size=30522,
max_position_embeddings=512,
sinusoidal_pos_embds=False, 
n_layers=6, 
n_heads=12, 
dim=768, 
hidden_dim=3072, 
dropout=0.1, 
attention_dropout=0.1, 
activation='gelu', 
initializer_range=0.02, 
qa_dropout=0.1, 
seq_classif_dropout=0.2, 
pad_token_id=0, 
**kwargs)
transformers.DistilBertTokenizerFast(vocab_file,
tokenizer_file=None,
do_lower_case=True,
unk_token='[UNK]',
sep_token='[SEP]',
pad_token='[PAD]',
cls_token='[CLS]',
mask_token='[MASK]',
tokenize_chinese_chars=True,
strip_accents=None,
**kwargs)

Get Encodings from tokenizer

Tokenizer: https://huggingface.co/transformers/main_classes/tokenizer.html

batch_encode_plus(batch_text_or_text_pairs,
add_special_tokens=True,
padding=False,
truncation=False,
max_length=None,
stride=0,
is_split_into_words=False,
pad_to_multiple_of=None,
return_tensors=None,
return_token_type_ids=None,
return_attention_mask=None,
return_overflowing_tokens=False,
return_special_tokens_mask=False,
return_offsets_mapping=False,
return_length=False,
verbose=True,
**kwargs)

Find One-freq rows to exclude from stratify split

Get train validation tensors

Get TensorDataset, Sampler and DataLoader

Load the Model for Sequence Classification

BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) 

XLNet:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=False) 

RoBERTa:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)

Choose Optimizer

Optimizer AdamW

classtransformers.AdamW(params,
lr          : 0.001,
betas       : 0.9, 0.999,
eps         : 1e-06,
weight_decay:  0.0,
correct_bias: True
)

Train Model using Torch

BCE with Logit Loss

Load and Preprocess Test Data

Tokenize Test Data

Create Tensors for Test Data

Create DataLoader for Test Data

Get the Predictions from Test Data

Model Evaluation

Confusion Matrix

Classification Report

Co-occurence Matrix

Time Taken