Reference: https://www.kaggle.com/c/web-traffic-time-series-forecasting/data
Original data: train_1.csv
-----------------------------
rows = 145,063
columns = 551
first column = Page
date columns = 2015-07-01, 2015-07-02, ..., 2016-12-31 (550 columns)
file size: 284.6 MB
References:
In a recurrent neural network we store the output activations from one or more of the layers of the network. Often these are hidden later activations. Then, the next time we feed an input example to the network, we include the previously-stored outputs as additional inputs. You can think of the additional inputs as being concatenated to the end of the “normal” inputs to the previous layer. For example, if a hidden layer had 10 regular input nodes and 128 hidden nodes in the layer, then it would actually have 138 total inputs (assuming you are feeding the layer’s outputs into itself à la Elman) rather than into another layer). Of course, the very first time you try to compute the output of the network you’ll need to fill in those extra 128 inputs with 0s or something.
Now, even though RNNs are quite powerful, they suffer from Vanishing gradient problem which hinders them from using long term information, like they are good for storing memory 3-4 instances of past iterations but larger number of instances don't provide good results so we don't just use regular RNNs. Instead, we use a better variation of RNNs: Long Short Term Networks(LSTM).
What is Vanishing Gradient problem?
Vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the "front" layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the front layers train very slowly.
Long Short Term Memory(LSTM)
Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Each of the three gates can be thought of as a "conventional" artificial neuron, as in a multi-layer (or feedforward) neural network: that is, they compute an activation (using an activation function) of a weighted sum. Intuitively, they can be thought as regulators of the flow of values that goes through the connections of the LSTM; hence the denotation "gate". There are connections between these gates and the cell.
The expression long short-term refers to the fact that LSTM is a model for the short-term memory which can last for a long period of time. An LSTM is well-suited to classify, process and predict time series given time lags of unknown size and duration between important events. LSTMs were developed to deal with the exploding and vanishing gradient problem when training traditional RNNs.
Components of LSTMs
So the LSTM cell contains the following components
Memory state “C” ( a vector)
Inputs to the LSTM cell at any step are Xt (current input) , Ht-1 (previous hidden state ) and Ct-1 (previous memory state).
Outputs from the LSTM cell are Ht (current hidden state ) and Ct (current memory state)
Working of gates in LSTMs
First, LSTM cell takes the previous memory state Ct-1 and does element wise multiplication with forget gate (f) to decide if present memory state Ct. If forget gate value is 0 then previous memory state is completely forgotten else f forget gate value is 1 then previous memory state is completely passed to the cell ( Remember f gate gives values between 0 and 1 ).
Ct = Ct-1 * ft
Calculating the new memory state:
Ct = Ct + (It * C`t)
Now, we calculate the output:
Ht = tanh(Ct)
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(context='notebook', style='whitegrid', rc={'figure.figsize': (12,8)})
plt.style.use('fivethirtyeight') # better than sns styles.
matplotlib.rcParams['figure.figsize'] = 12,8
import os
import time
# random state
SEED=100
np.random.seed(SEED)
# Jupyter notebook settings for pandas
#pd.set_option('display.float_format', '{:,.2g}'.format) # numbers sep by comma
from pandas.api.types import CategoricalDtype
np.set_printoptions(precision=3)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 200)
import IPython
from IPython.display import display, HTML, Image, Markdown
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import tensorflow
import keras
from keras import optimizers
from keras.optimizers import SGD
from keras.models import Model, Sequential
from keras.layers import Dense, LSTM, Dropout, GRU, Bidirectional
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers import RepeatVector, TimeDistributed, Flatten
from keras.utils import plot_model
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
# versions
import watermark
%load_ext watermark
%watermark -a "Bhishan Poudel" -d -v -m
print()
%watermark -iv
The watermark extension is already loaded. To reload it, use: %reload_ext watermark Bhishan Poudel 2020-10-17 CPython 3.7.7 IPython 7.13.0 compiler : Clang 4.0.1 (tags/RELEASE_401/final) system : Darwin release : 19.6.0 machine : x86_64 processor : i386 CPU cores : 4 interpreter: 64bit matplotlib 3.2.1 tensorflow 2.3.0 pandas 1.1.1 watermark 2.0.2 seaborn 0.10.1 keras 2.4.3 numpy 1.19.1 IPython 7.13.0
import sys
sys.path.append('../models')
from util_prophet import get_smape
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
def show_method_attributes(obj, ncols=7,start=None, inside=None):
""" Show all the attributes of a given method.
Example:
========
show_method_attributes(list)
"""
lst = [elem for elem in dir(obj) if elem[0]!='_' ]
lst = [elem for elem in lst
if elem not in 'os np pd sys time psycopg2'.split() ]
if isinstance(start,str):
lst = [elem for elem in lst if elem.startswith(start)]
if isinstance(start,tuple) or isinstance(start,list):
lst = [elem for elem in lst for start_elem in start
if elem.startswith(start_elem)]
if isinstance(inside,str):
lst = [elem for elem in lst if inside in elem]
if isinstance(inside,tuple) or isinstance(inside,list):
lst = [elem for elem in lst for inside_elem in inside
if inside_elem in elem]
return pd.DataFrame(np.array_split(lst,ncols)).T.fillna('')
def set_seed(seed):
import os
import random
import numpy as np
import tensorflow as tf
os.environ['PYTHONHASHSEED']=str(seed)
tf.random.set_seed(seed)
np.random.seed(seed)
random.seed(seed)
def get_mape(y_true, y_pred):
"Mean Absolute Percentage Error"
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def get_smape(y_true, y_pred):
"""Symmetric Mean Absolute Percentage Error
SMAPE = 200 * mean abs(A-F) A = actual
---------------- F = forecast
abs(A) + abs(F)
This ranges from 0 to 200 with 0 is best and 200 is worst.
"""
denom = (np.abs(y_true) + np.abs(y_pred))
ratio = np.abs(y_true - y_pred) / denom
ratio[denom == 0] = 0.0
return 200 * np.mean(ratio)
def timeseries_evaluation(model_name, desc, ytest, ypreds, df_eval=None,
show=False,sort_col='SMAPE'):
from statsmodels.tsa.stattools import acf
if df_eval is None:
df_eval = pd.DataFrame({'Model': [],
'Description':[],
'MAPE': [],
'SMAPE': [],
'RMSE': [],
'ME': [],
'MAE': [],
'MPE': [],
'CORR': [],
'MINMAX': [],
'ACF1': [],
})
mape = get_mape(ytest,ypreds) # MAPE
smape = get_smape(ytest,ypreds) # SMAPE
rmse = np.mean((ytest - ypreds)**2)**.5 # RMSE
me = np.mean(ytest - ypreds) # ME
mae = np.mean(np.abs(ytest - ypreds)) # MAE
mpe = np.mean((ytest - ypreds)/ytest) # MPE
corr = np.corrcoef(ytest, ypreds)[0, 1] # corr
mins = np.amin(np.hstack([ytest[:, None],
ypreds[:, None]]), axis=1)
maxs = np.amax(np.hstack([ytest[:, None],
ypreds[:, None]]), axis=1)
minmax = 1 - np.mean(mins/maxs) # minmax
acf1 = acf(ytest-ypreds, fft=False)[1] # ACF1 (autocorrelation function)
row = [model_name, desc, mape, smape,rmse, me,mae,mpe,corr,minmax,acf1]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
asc, cmap = True, 'Greens_r'
if sort_col == 'RMSE':
asc = False
cmap = 'Greens'
if sort_col == 'SMAPE':
asc = True
cmap = 'Greens_r'
df_eval = df_eval.sort_values(sort_col,ascending=asc)
df_eval = df_eval.reset_index(drop=True)
if show:
df_eval_style = (df_eval.style
.format({'MAPE': "{:,.0f}",
'SMAPE': "{:,.4f}",
'RMSE': "{:,.0f}",
'ME': "{:,.0f}",
'MAE': "{:,.0f}",
'MPE': "{:,.0f}",
'CORR': "{:,.4f}",
'MINMAX': "{:,.4f}",
'ACF1': "{:,.4f}"
})
.background_gradient(subset=[sort_col],cmap=cmap)
)
display(df_eval_style)
return df_eval
df_eval = None
%%time
df = pd.read_csv('../data/train_1.csv.zip',compression='zip')
cond = df['Page'] == "Prince_(musician)_en.wikipedia.org_all-access_all-agents"
df = df.loc[cond]
df = df.filter(regex="Page|2016")
df = df.melt(id_vars=['Page'],var_name='date',value_name='visits').drop('Page',axis=1)
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.shape)
df.head()
(366, 1) CPU times: user 8.96 s, sys: 893 ms, total: 9.85 s Wall time: 11.1 s
visits | |
---|---|
date | |
2016-01-01 | 20947.0 |
2016-01-02 | 19466.0 |
2016-01-03 | 8587.0 |
2016-01-04 | 7386.0 |
2016-01-05 | 7719.0 |
ts = df['visits'].astype(np.float32)
ts.index = pd.to_datetime(ts.index,format='%Y-%m-%d')
ts.head()
date 2016-01-01 20947.0 2016-01-02 19466.0 2016-01-03 8587.0 2016-01-04 7386.0 2016-01-05 7719.0 Name: visits, dtype: float32
def create_Xlags_y(x, lags=1):
"""Create Xlag and y array from 1d timeseries x.
Parameters
----------
x : 1d numpy array, shape = (n,1)
lag : integer
Returns
--------
Xlag -- numpy array with columns = lag
y -- 1d array (values after lag from x)
Example
-------
x = np.arange(6).reshape(-1,1)
lag = 2
y = np.array([2,3,4]) # two values 0 and 1 are excluded due to lag
Xlag = numpy array with 2 columns
np.array([[0, 1],
[1, 2],
[2, 3]])
"""
lst_Xlags, lst_y = [], []
for i in range(len(x)-lags-1):
a = x[i:(i+lags), 0]
lst_Xlags.append(a)
lst_y.append(x[i + lags, 0])
Xlags = np.array(lst_Xlags)
y = np.array(lst_y)
return Xlags, y
def train_test_split_with_lags(x,train_size=0.8,lags=1):
"""Train-test split for time series with lags.
If x is x_scaled then Xtrain and ytrain will also be scaled.
Returns
-------
Xtrain,ytrain,Xtest,ytest
"""
# split into train and test sets
num = int(len(x) * train_size)
train = x[0:num, :]
test = x[num:len(x), :]
# reshape into X=t and Y=t+1
Xtrain, ytrain = create_Xlags_y(train, lags)
Xtest, ytest = create_Xlags_y(test , lags)
return Xtrain,ytrain,Xtest,ytest
def model_lstm(Xtrain,ytrain,params):
# parameters
loss = params['loss']
optimizer = params['optimizer']
batch_size = params['batch_size']
dropout = params['dropout']
lags = params['lags']
units = params['units']
epochs = params['epochs']
input_shape= (1, lags)
# model
model = Sequential()
model.add(LSTM(units=units, return_sequences=True, input_shape=input_shape))
model.add(Dropout(dropout))
# # Second LSTM layer
# model.add(LSTM(units=units, return_sequences=True))
# model.add(Dropout(dropout))
# # Third LSTM layer
# model.add(LSTM(units=units, return_sequences=True))
# model.add(Dropout(dropout))
# 2nd last layer
model.add(LSTM(units=units))
model.add(Dropout(dropout))
# The output layer
model.add(Dense(units=1))
# Compiling the model
model.compile(loss=loss, optimizer=optimizer)
# Fitting to the training set
model.fit(Xtrain,ytrain,epochs=epochs,batch_size=batch_size,
shuffle=False,verbose=0,use_multiprocessing=True)
return model
def model_gru(Xtrain,ytrain,params):
batch_size = params['batch_size']
dropout = params['dropout']
optimizer = params['optimizer']
units = params['units']
activation = params['activation']
lags = params['lags']
loss = params['loss']
epochs = params['epochs']
input_shape=(1, lags)
# The GRU architecture
model = Sequential()
# First GRU layer with Dropout regularisation
model.add(GRU(units=units, return_sequences=True,
input_shape=input_shape, activation=activation,))
model.add(Dropout(dropout))
# # Second GRU layer
# model.add(GRU(units=units, return_sequences=True,
# input_shape=input_shape, activation=activation))
# model.add(Dropout(dropout))
# # Third GRU layer
# model.add(GRU(units=units, return_sequences=True,
# input_shape=input_shape, activation=activation))
# model.add(Dropout(dropout))
# 2nd last layer
model.add(GRU(units=units, activation=activation))
model.add(Dropout(dropout))
# The output layer
model.add(Dense(units=1))
# Compiling the RNN
model.compile(optimizer=optimizer,loss=loss)
# Fitting to the training set
model.fit(Xtrain,ytrain,epochs=epochs,
batch_size=batch_size,shuffle=False,verbose=0)
return model
def predict(model,Xtrain,Xtest,scaler=None):
# make predictions
ypreds_train = model.predict(Xtrain)
ypreds_test = model.predict(Xtest)
# invert predictions
if scaler is not None:
ypreds_train = scaler.inverse_transform(ypreds_train)
ypreds_test = scaler.inverse_transform(ypreds_test)
return ypreds_train, ypreds_test
def prediction_plot(x,lags,ypreds_train,ypreds_test):
# shift train predictions for plotting
ypred_train_plot = np.empty_like(x)
ypred_train_plot[:, :] = np.nan
ypred_train_plot[lags:len(ypreds_train)+lags, :] = ypreds_train
# shift test predictions for plotting
ypreds_test_plot = np.empty_like(x)
ypreds_test_plot[:, :] = np.nan
ypreds_test_plot[len(ypreds_train)+(lags*2) +
1:len(x)-1, :] = ypreds_test
# plot baseline and predictions
plt.plot(x, label='data')
plt.plot(ypred_train_plot, label='ypreds train')
plt.plot(ypreds_test_plot, label='ypreds test')
plt.legend()
plt.show()
def run_model(mymodel,model_name,desc,params,scaler=None,ts=ts,
SEED=SEED,
df_eval=df_eval,show=True):
# seed
set_seed(SEED)
# timeseries data
x = ts.to_numpy().reshape(-1,1)
x = x.astype('float32')
x_scaled = x
# scaling
if scaler is not None:
x_scaled = scaler.fit_transform(x)
# train test split
Xtrain,ytrain,Xtest,ytest = train_test_split_with_lags(x_scaled,
lags=params['lags'])
# reshape input to be [samples, time steps, features]
Xtrain = np.reshape(Xtrain, (Xtrain.shape[0], 1, Xtrain.shape[1]))
Xtest = np.reshape(Xtest , (Xtest.shape[0] , 1, Xtest.shape[1]))
# fit the model
model = mymodel(Xtrain,ytrain,params)
# preds
ypreds_train, ypreds_test = predict(model,Xtrain,Xtest,scaler=scaler)
if scaler is not None:
ytrain = scaler.inverse_transform(ytrain.reshape(-1,1))
ytest = scaler.inverse_transform(ytest.reshape(-1,1))
df_eval = timeseries_evaluation(model_name, desc, ytest.ravel(),
ypreds_test.ravel(),
df_eval=df_eval,show=show)
return dict(df_eval=df_eval,
x=x,
ytrain=ytrain,
ypreds_train=ypreds_train,
ytest=ytest,
ypreds_test=ypreds_test)
ts = df['visits'].astype(np.float32)
ts.index = pd.to_datetime(ts.index,format='%Y-%m-%d')
ts.head()
date 2016-01-01 20947.0 2016-01-02 19466.0 2016-01-03 8587.0 2016-01-04 7386.0 2016-01-05 7719.0 Name: visits, dtype: float32
# parameters
params_lstm = dict(loss='mean_squared_error',
optimizer='rmsprop',
batch_size=1,
dropout=0.2,
epochs=100,
units=4,
lags=2,
)
params_gru = dict(loss='mean_squared_error',
optimizer=SGD(
lr=0.01, decay=1e-7, momentum=0.9,nesterov=False),
activation='tanh',
batch_size=1,
dropout=0.2,
epochs=100,
units=10,
lags=2,
)
df_eval = None
scaler = MinMaxScaler(feature_range=(0, 1))
res = run_model(model_lstm,'lstm','lags=2',params_lstm,scaler=None,ts=ts,
SEED=SEED,
df_eval=df_eval,show=True)
df_eval = res['df_eval']
Model | Description | MAPE | SMAPE | RMSE | ME | MAE | MPE | CORR | MINMAX | ACF1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | lstm | lags=2 | 99 | 197.2470 | 13,684 | 12,021 | 12,021 | 1 | 0.0502 | 0.9931 | 0.6727 |
%%time
res = run_model(model_lstm,'lstm','lags=2,minmax-scaler',params_lstm,scaler=scaler,
ts=ts,
df_eval=df_eval,show=True)
df_eval = res['df_eval']
Model | Description | MAPE | SMAPE | RMSE | ME | MAE | MPE | CORR | MINMAX | ACF1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | lstm | lags=2,minmax-scaler | 25 | 24.6649 | 6,524 | 353 | 3,482 | -0 | 0.6507 | 0.2056 | 0.6702 |
1 | lstm | lags=2 | 99 | 197.2470 | 13,684 | 12,021 | 12,021 | 1 | 0.0502 | 0.9931 | 0.6727 |
CPU times: user 1min 4s, sys: 5.48 s, total: 1min 9s Wall time: 42.5 s
%%time
res = run_model(model_gru,'gru','lags=2',params_gru,scaler=None,
ts=ts,
df_eval=df_eval,show=True)
df_eval = res['df_eval']
/Users/poudel/opt/miniconda3/envs/tf2/lib/python3.7/site-packages/numpy/lib/function_base.py:2559: RuntimeWarning: invalid value encountered in true_divide /Users/poudel/opt/miniconda3/envs/tf2/lib/python3.7/site-packages/numpy/lib/function_base.py:2560: RuntimeWarning: invalid value encountered in true_divide
Model | Description | MAPE | SMAPE | RMSE | ME | MAE | MPE | CORR | MINMAX | ACF1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | lstm | lags=2,minmax-scaler | 25 | 24.6649 | 6,524 | 353 | 3,482 | -0 | 0.6507 | 0.2056 | 0.6702 |
1 | gru | lags=2 | 40 | 53.1378 | 8,700 | 5,739 | 5,739 | 0 | nan | 0.4031 | 0.6727 |
2 | lstm | lags=2 | 99 | 197.2470 | 13,684 | 12,021 | 12,021 | 1 | 0.0502 | 0.9931 | 0.6727 |
CPU times: user 1min 3s, sys: 6.07 s, total: 1min 9s Wall time: 42.6 s
%%time
res = run_model(model_gru,'gru','lags=2,minmax-scaling',params_gru,scaler=scaler,
ts=ts,
df_eval=df_eval,show=True)
df_eval = res['df_eval']
Model | Description | MAPE | SMAPE | RMSE | ME | MAE | MPE | CORR | MINMAX | ACF1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | lstm | lags=2,minmax-scaler | 25 | 24.6649 | 6,524 | 353 | 3,482 | -0 | 0.6507 | 0.2056 | 0.6702 |
1 | gru | lags=2 | 40 | 53.1378 | 8,700 | 5,739 | 5,739 | 0 | nan | 0.4031 | 0.6727 |
2 | gru | lags=2,minmax-scaling | 58 | 83.9143 | 8,932 | 7,146 | 7,192 | 1 | 0.5818 | 0.5815 | 0.0733 |
3 | lstm | lags=2 | 99 | 197.2470 | 13,684 | 12,021 | 12,021 | 1 | 0.0502 | 0.9931 | 0.6727 |
CPU times: user 1min 1s, sys: 7.18 s, total: 1min 8s Wall time: 40 s
# prediction plot
prediction_plot(res['x'],params_gru['lags'],res['ypreds_train'],res['ypreds_test'])
The formula for SMAPE (Symmetric Mean Absolute Percentage Error) is given below:
$$ S M A P E=\frac{100 \%}{n} \sum_{t=1}^{n} \frac{\left|F_{t}-A_{t}\right|}{\left(\left|A_{t}\right|+\left|F_{t}\right|\right) / 2} $$Where, F is forecast and A is the actual value of time series at given time t.
Python implementation:
def smape(A, F):
F = A[:len(A)]
return ( 200.0/len(A) * np.sum( np.abs(F - A) /
(np.abs(A) + np.abs(F) + np.finfo(float).eps))
)
Despite the name Symmetric, the smape is not actually symmetric. Take this example from wikipedia for an example:
The SMAPE is not symmetric since over- and under-forecasts are not treated equally. This is illustrated by the following example by applying the SMAPE formula:
Over-forecasting : At = 100 and Ft = 110 gives SMAPE = 4.76%
Under-forecasting: At = 100 and Ft = 90 gives SMAPE = 5.26%.
SMAPE lies between 0 to 200.
0 is best and 200 is the worst.