Table of Contents

Introduction

Data Description

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

toxic
severe_toxic
obscene
threat
insult
identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

Imports

Useful Scripts

Load the Data

Class distribution

Select only text column

Check nans and duplicates

Separate the classes

Word clouds

Treat the apostrophes

References:

Frequency Distribution

Tf-idf Vectorization