Table of Contents

Description

Twitter sentiment analysis.

Model Evaluation Metric: Weighted F-1 score.

Load the libraries

Load the data

Train test split

Modelling

TF-IDF Vectorizing

Term Frequency : This gives how often a given word appears within a document.

$\mathrm{TF}=\frac{\text { Number of times the term appears in the doc }}{\text { Total number of words in the doc }}$

Inverse Document Frequency: This gives how often the word appers across the documents. If a term is very common among documents (e.g., “the”, “a”, “is”), then we have low IDF score.

$\mathrm{IDF}=\ln \left(\frac{\text { Number of docs }}{\text { Number docs the term appears in }}\right)$

Term Frequency – Inverse Document Frequency TF-IDF: TF-IDF is the product of the TF and IDF scores of the term.

$\mathrm{TF}\mathrm{IDF}=\mathrm{TF} * \mathrm{IDF}$

In machine learning, TF-IDF is obtained from the class TfidfVectorizer. It has following parameters:

NOTE:

Logistic RegressionCV

Linear SVC

GaussianNB

Random Forest

SGD Classifier

Hyperparameter Tuning

Tuning TF-IDF Vectorizer using Pipeline

TfidfVectorizer(input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,analyzer='word',stop_words=None,token_pattern='(?u)\\b\\w\\w+\\b',ngram_range=(1, 1),max_df=1.0,min_df=1,max_features=None,vocabulary=None,binary=False,dtype=<class 'numpy.float64'>,norm='l2',use_idf=True,smooth_idf=True, sublinear_tf=False,
)

Best Model