Table of Contents

Data Description

Reference: https://www.kaggle.com/c/web-traffic-time-series-forecasting/data

Original data: train_1.csv
-----------------------------
rows = 145,063
columns = 551
first column = Page
date columns = 2015-07-01, 2015-07-02, ..., 2016-12-31 (550 columns)
file size: 284.6 MB


Data for modelling: Prince Musician
-------------------------------------------------------
timeseries  : 2016 page visits for Prince 

lag columns : lag1 to lag7
bias        : bias column

For ARIMA   : we have only one timeseries (one column)
For sklearn : For linear regressor, ensemble learners we can have many columns

Colab

Load the Libraries

Useful Scripts

MAPE - Mean Absolute Percentage Error: $$ M A P E=\frac{100}{n} \sum_{i=1}^{n} \frac{\left|y_{i}-\hat{y}_{i}\right|}{y_{i}} $$

SMAPE - Symmetric Mean Absolute Percentage Error:

$$ S M A P E = \frac{100 \%}{n} \sum_{i=1}^{n} \frac{\left|y_{i} - \hat{y}\right|}{\left(\left|y_i\right| + \left|\hat{y}\right|\right) / 2}\\ \quad \quad = \frac{200 \%}{n} \sum_{i=1}^{n} \frac{\left|y_{i} - \hat{y}\right|}{ \left|y_i\right| + \left|\hat{y}\right|} $$

Load the data

Choose Prince Musician data as timeseries

Data Preprocessing

Add lag columns

Add bias term

Add timeseries features

Modelling

Train Test split

Modelling: Xgboost

Adding timeseries features using tsfresh

Ref: https://github.com/blue-yonder/tsfresh/blob/main/notebooks/examples/01%20Feature%20Extraction%20and%20Selection.ipynb

Using Pipeline for tsfresh relevant feature augmenter

Ref: https://github.com/blue-yonder/tsfresh/blob/main/notebooks/examples/02%20sklearn%20Pipeline.ipynb

Cross validation for timeseries

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]

Time Taken