Table of Contents

Introduction

Data Description

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

toxic
severe_toxic
obscene
threat
insult
identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

Imports

Load the Data

Creating text features

Reference

Basic Text Processing

You are annoying!!! goJumpOff4Cliff pleaseeeeeeee

Step 1 - Remove Punctuation

Step 2 - Remove Digits

Step 3 - Split combined words

For instance, converting whyAreYou to why Are You

Step 4 - Convert to lowercase

Step 5 - Split each sentence into words

Step 6 - Remove Stop Words

Step 7 - Convert Word to Base Form or Lematize

Visualization