notebooks
directory has all the notebooks for exploratory data analysis, visualization, modelling and model interpretation.Data source: kaggle The first column is the name of the page and rest 550 columns are visited date.
Original data: train_1.csv
-----------------------------
rows = 145,063
columns = 551
first column = Page
date columns = 2015-07-01, 2015-07-02, ..., 2016-12-31 (550 columns)
file size: 284.6 MB
Date columns:
------------------
Jul/2015 - 31 days
Aug/2015 - 31 days
Sep/2015 - 30 days
Oct/2015 - 31 days
Nov/2015 - 30 days
Dec/2015 - 31 days
Total : 184 days
Year 2016 : 366 days (leap year)
Total : 550 days
NOTE:
For this dataset, missing data is represented by 0.
Time series selected for modelling:
ARIMA: most visited page ==> Special:Search_en.wikipedia.org_desktop_all-agents
visited = 675606021
VAR: VAR needs correlated times series like opening and closing of stock.
But, here I took top page per language to see the workings of VAR models
on wikipedia dataset.
Scikit-learn: For usual regressors like linear, lasso, ridge and for also
ensemble method xgbregressor, I used most visited page.
fbprophet: For facebook prophet time series modelling module, I used a random
time series. The page is Now You See me in Spanish Language.
deep-learning: For deep learning algorithms like LSTM and GRU, I used the same
time series as I used in fbprophet.
The best smape for given timeseries was given by xgboost using features from tsfresh.
Model | Description | MAPE | SMAPE | RMSE | ME | MAE | MPE | CORR | MINMAX | ACF1 |
---|---|---|---|---|---|---|---|---|---|---|
xgb | tsfresh | 1 | 0.6356 | 337 | 43 | 115 | 0 | 0.9991 | 0.0063 | -0.2886 |
xgb | default | 1 | 1.4308 | 453 | 9 | 224 | 0 | 0.9978 | 0.0141 | -0.3971 |
XGBRegressor | default | 18 | 18.2580 | 4,513 | 687 | 2,331 | 0 | 0.6643 | 0.1565 | 0.1207 |
LassoCV | ts_split=3 | 266 | 110.8415 | 25,829 | -25,336 | 25,537 | -3 | 0.5769 | 0.7062 | -0.4231 |
RidgeCV | ts_split=3 | 261 | 118.8720 | 31,289 | -15,694 | 25,228 | -2 | -0.0255 | 0.8816 | 0.6251 |
LinearRegression | default | 365 | 135.3122 | 43,579 | -17,255 | 35,357 | -2 | -0.1236 | 1.2735 | 0.6457 |
LinearRegression | scaled | 33,841,890 | 199.9984 | 4,378,715,364 | -3,640,663,624 | 3,640,663,624 | -338,419 | 0.5725 | 1.0000 | 0.0784 |
lstm | lags=2,minmax-scaler | 25 | 24.6649 | 6,524 | 353 | 3,482 | -0 | 0.6507 | 0.2056 | 0.6702 |
gru | lags=2 | 40 | 53.1378 | 8,700 | 5,739 | 5,739 | 0 | nan | 0.4031 | 0.6727 |
gru | lags=2,minmax-scaling | 58 | 83.9143 | 8,932 | 7,146 | 7,192 | 1 | 0.5818 | 0.5815 | 0.0733 |
lstm | lags=2 | 99 | 197.2470 | 13,684 | 12,021 | 12,021 | 1 | 0.0502 | 0.9931 | 0.6727 |
fbprophet | seasonality_after_cap_floor | 65 | 100.4473 | 9,009 | 3,603 | 7,426 | 0 | 0.2990 | 0.5764 | 0.4837 |
fbprophet | seasonality_before_cap_floor | 423 | 139.6339 | 54,487 | -3,904 | 44,547 | -0 | 0.1662 | 2.3876 | 0.5637 |
fbprophet | after_cap_floor | 82 | 147.0780 | 12,655 | 7,658 | 10,089 | 1 | -0.0811 | 0.7741 | 0.4794 |
fbprophet | default | 437 | 171.8289 | 54,429 | 25,011 | 48,699 | 2 | -0.2529 | 3.3718 | 0.4491 |
fbprophet | before_cap_floor | 437 | 171.8289 | 54,429 | 25,011 | 48,699 | 2 | -0.2529 | 3.3718 | 0.4491 |
The data set is super clean, I did not have to do anything. One thing to note is that the nans are represented by 0. This means if some website has 0 visits, it may mean either the acutally 0 persons visited the website or simply the data is not available for that day. The first column is Page and rest 550 columns are dates. For the time series we can create date time for visualization and also for the linear regression modellings.
df['year'] = df['date'].dt.year # yyyy
df['month'] = df['date'].dt.month # 1 to 12
df['day'] = df['date'].dt.day # 1 to 31
df['quarter'] = df['date'].dt.quarter # 1 to 4
df['dayofweek'] = df['date'].dt.dayofweek # 0 to 6
df['dayofyear'] = df['date'].dt.dayofyear # 1 to 366 (leap year)
df['weekend'] = ((df['date'].dt.dayofweek) // 5 == 1)
df['weekday'] = ((df['date'].dt.dayofweek) // 5 != 1)
df['day_name'] = df['date'].dt.day_name() # Monday
df['month_name'] = df['date'].dt.month_name() # January
For time series visualization, plotly is better tool to visualize the data. For visualization purpose, I looked at only the data of 2016.
# of unique pages visited in 2016: 14,506
Top visited page: Special:Search_en.wikipedia.org_desktop_all-agents (675,606,021 visits)
To fit a linear regression to a given dataset we need the dataset follow some of the assumptions. They are called assumptions of linear regression. Since ARIMA tries to fit the linear regression taking account with autocorrelation with past of itself, still this is a linear regression. We can do some of the linear regression assumptions.
Test of normality: Using Shapiro-Wilk normality test, I found time series is NOT normally distributed.
Test of stationarity: I used Augmented Dickey Fuller test to know whether the given time series is stationary or not. In this particular page, I found the time series is stationary.
For time series, probably ARIMA (or, SARIMAX) is the most popular algorithm to try on. I used both the usual arima model from statsmodels
and also a dedicated library pmdarima
to fit the arima model. The details are explained in the notebook.
After doing ARIMA modelling, I was curious what will VAR model do with this wikipedia time series. For VAR method to be used the columns of dataset must be related to each other like opening and closing of the stock. However, just for the purpose of the algorithm implentation and fiddling with the model, I looked at the top 5 pages per language and fitted the model.
Then, I went back in time and wanted to see how will the usual sklearn models like linear regression, lasso and ridge will do with the time series data. I also did some ensemble learning models like xgbregressor. XGBRegressor did pretty good and gave me the SMAPE value of 6.65 for the training data. For a random page (Now You See Me Spanish page), I got the smape of 21.68 on the training data.
For time series forecasting, one of the popular model is prophet open sourced by facebook. This pretty powerful and useful library for the time series modelling.
Then, I wanted to see the usage of deep learning in time series modelling. Particularly, I looked at the models like LSTM and GRU which can remember the past data. I can not use usual CNN since they do not remember the past data points. LSTM did pretty well and gave me smape of 20.34 for the test dataset.
One of the most popular metric to determine the performance of time series model is SMAPE (Symmetric Mean Absolute Percentage Error).
The formula for SMAPE (Symmetric Mean Absolute Percentage Error) is given below:
SMAPE = 200 * mean abs(A-F)
-----------------
abs(A) + abs(F)
SMAPE lies between 0 and 200, 0 is best and 200 is worst.
$$
S M A P E=\frac{100 \%}{n} \sum_{t=1}^{n} \frac{\left|F_{t}-A_{t}\right|}{\left(\left|F_{t}\right|+\left|A_{t}\right|\right) / 2}
$$
Where, F is forecast and A is the actual value of time series at given time t.
Python implementation:
def smape(A, F):
F = A[:len(A)]
return ( 200.0/len(A) * np.sum( np.abs(F - A) /
(np.abs(A) + np.abs(F) + np.finfo(float).eps))
)
Despite the name Symmetric, the smape is not actually symmetric. Take this example from wikipedia for an example:
The SMAPE is not symmetric since over- and under-forecasts are not treated equally. This is illustrated by the following example by applying the SMAPE formula:
Over-forecasting : At = 100 and Ft = 110 gives SMAPE = 4.76%
Under-forecasting: At = 100 and Ft = 90 gives SMAPE = 5.26%.