Table of Contents

Data Description

Reference: https://www.kaggle.com/c/web-traffic-time-series-forecasting/data

Original data: train_1.csv
-----------------------------
rows = 145,063
columns = 551
first column = Page
date columns = 2015-07-01, 2015-07-02, ..., 2016-12-31 (550 columns)
file size: 284.6 MB


Data for time series
----------------------------------------------
selected year: 2016 (leap year 366 days)

selected time series:

The most visited page.
df['Page'] == """Special:Search_en.wikipedia.org_desktop_all-agents"""

Notes

Useful Functions

MAPE - Mean Absolute Percentage Error: $$ M A P E=\frac{100}{n} \sum_{i=1}^{n} \frac{\left|y_{i}-\hat{y}_{i}\right|}{y_{i}} $$

SMAPE - Symmetric Mean Absolute Percentage Error:

$$ S M A P E=\frac{100 \%}{n} \sum_{t=1}^{n} \frac{\left|F_{t}-A_{t}\right|}{\left(\left|A_{t}\right|+\left|F_{t}\right|\right) / 2} $$$$ S M A P E=\frac{100 \%}{n} \sum_{i=1}^{n} \frac{\left|y_{i}-\hat{y}_{i}\right|}{\left(\left|y_{i}\right|+\left|\hat{y}_{i}\right|\right) / 2} $$

Load the data

Modelling Timeseries

Modelling: ARIMA for non-seasonal timeseries

ARIMA(pdq)

AR - auto regressive (p for past day effect)
I - integrated (d for differencing)
MA - moving averge (q for fluctuations errors)

stationary = mean, variance does not change over time

To make a timeseries stationary we can use differencing:

differencing: ydiff(t) = y(t) - y(t-1)
first order : ydiff1   = y(t) - y(t-1)
second order: ydiff2   = ydiff(t) - ydiff(t-1)
                       = [y(t) - y(t-1)] - [y(t-1) - y(t-2)]
                       = y(t) - 2y(t-1) + y(t-2)

Models:

T = trend 
C = cycle
S = seasonality
R = residual

Additive model: Y = T + C + S + R 
Multiplicative model: Y = T * C * S * R

References

Find Auto Regressive AR term p (partial correlation)

We can use Partial Autocorrelation (PACF) plot to estimate the autoregressive term p.

Find Integrated I term d (differentiation autocorrelation)

Find Moving Average MA term q

ARIMA Modelling

Model Evaluation

Auto ARIMA

Interpret the residual plots in ARIMA model

Plot diagnostics:

Fig1: The residuals over time (top left plot) don't display any obvious seasonality and appear to be white noise. There are two peaks, but they might be outiers.

Fig2 : In the top right plot, we see that the red KDE line does NOT follow closely with the normal distribution N(0,1) line. So the residuals are NOT normally distributed. NOT Good.

Fig3: The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) does NOT follow the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are NOT normally distributed.

Fig4: (bottom left) This is an autocorrelation (i.e. correlogram) plot of the time series residuals to the lagged versions of itself. This shows there is not much correlation. Almost all of the correlations lies within the significance region. There is not much seasonality in the data.

Meaning: Our fitting is not so good.
Further work: We may want to normalize the data. We may want do deal with outliers.

Future Predictions