Description

In this project, we use the data from kaggle competition Toxic Comment Classification Challenge by Jigsaw and only use the training data. Then we have break this raw training data into train and test data and evaluate the model performances in test data.

The dataset is taken from wikipedia edit text and is classified as one of the following:

  1. toxic
  2. severe_toxic
  3. obscene
  4. threat
  5. insult
  6. identity_hate

This is a multi-label (not-multiclass) classification. One text row has six labels and exactly one label is 1 and other labels are 0.

Load the libraries

Parameters

Load the Data

Multilabel Visualization

Reference: https://www.kaggle.com/loganathanspr/toxic-comments-insight-into-datasets

Top 30 words per comment type

Now comes the meaty part. What kind of vocabulary is used in different types of comments? We are especially interested in bad comments in general. Let's find top 30 words for each comment type from the training data. The way we are going to look at is by taking the TF-IDF of the training data set and find most important words for each comment category.