Reference: https://www.kaggle.com/c/web-traffic-time-series-forecasting/data
train_1.csv:
rows = 145,063
columns = 551
first column = Page
date columns = 2015-07-01, 2015-07-02, ..., 2016-12-31 (550 columns)
file size: 284.6 MB
Date columns:
------------------
Jul/2015 - 31 days
Aug/2015 - 31 days
Sep/2015 - 30 days
Oct/2015 - 31 days
Nov/2015 - 30 days
Dec/2015 - 31 days
Total : 184 days
Year 2016 : 366 days (leap year)
Total : 550 days
NOTE:
For this dataset, missing data is represented by 0.
The training dataset consists of approximately 145k time series. Each of these time series represent a number of daily views of a different Wikipedia article, starting from July, 1st, 2015 up until December 31st, 2016. The leaderboard during the training stage is based on traffic from January, 1st, 2017 up until March 1st, 2017.
The second stage will use training data up until September 1st, 2017. The final ranking of the competition will be based on predictions of daily views between September 13th, 2017 and November 13th, 2017 for each article in the dataset. You will submit your forecasts for these dates by September 12th.
For each time series, you are provided the name of the article as well as the type of traffic that this time series represent (all, mobile, desktop, spider). You may use this metadata and any other publicly available data to make predictions. Unfortunately, the data source for this dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day.
To reduce the submission file size, each page and date combination has been given a shorter Id. The mapping between page names and the submission Id is given in the key files.
# mkdir ~/.kaggle
# !echo '<PASTE_CONTENTS_OF_KAGGLE_API_JSON>' > ~/.kaggle/kaggle.json
# after we have ~/.kaggle/kaggle.json file in colab, we can install kaggle module.
!chmod 600 ~/.kaggle/kaggle.json
!head -c 20 ~/.kaggle/kaggle.json
{"username":"bhishan
%%capture
# capture will not print in notebook
import os
import sys
ENV_COLAB = 'google.colab' in sys.modules
if ENV_COLAB:
## install modules
!pip install watermark
!pip install fsspec
!pip install dask[dataframe]
## create project like folders
!mkdir -p ../data ../outputs ../images ../reports ../html ../models
!pip install kaggle
# !kaggle competitions files -c web-traffic-time-series-forecasting
# !kaggle competitions download -c web-traffic-time-series-forecasting -f train_1.csv.zip -p ../data/
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import os
import time
time_start_notebook = time.time()
# random state
SEED=100
np.random.seed(SEED) # we need this in each cell
# Jupyter notebook settings for pandas
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', '{:,.2g}'.format) # numbers sep by comma
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 100)
import re
import dask
import dask.dataframe as dd
import gc
# versions
import watermark
%load_ext watermark
%watermark -a "Bhishan Poudel" -d -v -m
print()
%watermark -iv
Bhishan Poudel 2020-10-14 CPython 3.7.7 IPython 7.18.1 compiler : Clang 4.0.1 (tags/RELEASE_401/final) system : Darwin release : 19.6.0 machine : x86_64 processor : i386 CPU cores : 4 interpreter: 64bit autopep8 1.5.2 json 2.0.9 dask 2.13.0 re 2.2.1 watermark 2.0.2 matplotlib 3.2.1 seaborn 0.11.0 numpy 1.18.4 pandas 1.1.0
def show_method_attributes(method, ncols=7):
""" Show all the attributes of a given method.
Example:
========
show_method_attributes(list)
"""
x = [i for i in dir(method) if i[0].islower()]
x = [i for i in x if i not in 'os np pd sys time psycopg2'.split()]
return pd.DataFrame(np.array_split(x,ncols)).T.fillna('')
def json_dump_tofile(myjson,ofile,sort_keys=False):
"""Write json dictionary to a datafile.
Usage:
myjson = {'num': 5, my_list = [1,2,'apple']}
json_dump_tofile(myjson, ofile)
"""
import io
import json
with io.open(ofile, 'w', encoding='utf8') as fo:
json_str = json.dumps(myjson,
indent=4,
sort_keys=sort_keys,
separators=(',', ': '),
ensure_ascii=False)
fo.write(str(json_str))
!ls ../data
most_visited_2016.csv train_1.csv.zip train_1_01?raw=true train_1_02?raw=true train_1_03?raw=true
df = pd.read_csv('../data/train_1.csv.zip',compression='zip',encoding='latin-1')
print(df.shape)
display(df.head())
(145063, 551)
Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | 2015-08-06 | 2015-08-07 | 2015-08-08 | 2015-08-09 | 2015-08-10 | 2015-08-11 | 2015-08-12 | 2015-08-13 | 2015-08-14 | 2015-08-15 | 2015-08-16 | 2015-08-17 | 2015-08-18 | ... | 2016-11-12 | 2016-11-13 | 2016-11-14 | 2016-11-15 | 2016-11-16 | 2016-11-17 | 2016-11-18 | 2016-11-19 | 2016-11-20 | 2016-11-21 | 2016-11-22 | 2016-11-23 | 2016-11-24 | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2NE1_zh.wikipedia.org_all-access_spider | 18 | 11 | 5 | 13 | 14 | 9 | 9 | 22 | 26 | 24 | 19 | 10 | 14 | 15 | 8 | 16 | 8 | 8 | 16 | 7 | 11 | 10 | 20 | 18 | 15 | 14 | 49 | 10 | 16 | 18 | 8 | 5 | 9 | 7 | 13 | 9 | 7 | 4 | 11 | 10 | 5 | 9 | 9 | 9 | 9 | 13 | 4 | 15 | 25 | ... | 13 | 8 | 15 | 14 | 12 | 6 | 11 | 10 | 42 | 21 | 24 | 14 | 11 | 2e+02 | 14 | 45 | 33 | 28 | 18 | 14 | 47 | 15 | 14 | 18 | 20 | 14 | 16 | 14 | 20 | 60 | 22 | 15 | 17 | 19 | 18 | 21 | 21 | 47 | 65 | 17 | 32 | 63 | 15 | 26 | 14 | 20 | 22 | 19 | 18 | 20 |
1 | 2PM_zh.wikipedia.org_all-access_spider | 11 | 14 | 15 | 18 | 11 | 13 | 22 | 11 | 10 | 4 | 41 | 65 | 57 | 38 | 20 | 62 | 44 | 15 | 10 | 47 | 24 | 17 | 22 | 9 | 39 | 13 | 11 | 12 | 21 | 19 | 9 | 15 | 33 | 8 | 8 | 7 | 13 | 2 | 23 | 12 | 27 | 27 | 36 | 23 | 58 | 80 | 60 | 69 | 42 | ... | 12 | 11 | 14 | 28 | 23 | 20 | 9 | 12 | 11 | 14 | 14 | 15 | 15 | 11 | 20 | 13 | 19 | 6.2e+02 | 57 | 17 | 23 | 19 | 21 | 47 | 28 | 22 | 22 | 65 | 27 | 17 | 17 | 13 | 9 | 18 | 22 | 17 | 15 | 22 | 23 | 19 | 17 | 42 | 28 | 15 | 9 | 30 | 52 | 45 | 26 | 20 |
2 | 3C_zh.wikipedia.org_all-access_spider | 1 | 0 | 1 | 1 | 0 | 4 | 0 | 3 | 4 | 4 | 1 | 1 | 1 | 6 | 8 | 6 | 4 | 5 | 1 | 2 | 3 | 8 | 8 | 6 | 6 | 2 | 2 | 3 | 2 | 4 | 3 | 3 | 5 | 3 | 5 | 4 | 2 | 5 | 1 | 4 | 5 | 0 | 0 | 7 | 3 | 5 | 1 | 6 | 2 | ... | 6 | 4 | 2 | 4 | 6 | 5 | 4 | 4 | 3 | 3 | 9 | 3 | 5 | 4 | 0 | 1 | 4 | 5 | 8 | 8 | 1 | 1 | 2 | 5 | 3 | 3 | 3 | 7 | 3 | 9 | 8 | 3 | 2.1e+02 | 5 | 4 | 6 | 2 | 2 | 4 | 3 | 3 | 1 | 1 | 7 | 4 | 4 | 6 | 3 | 4 | 17 |
3 | 4minute_zh.wikipedia.org_all-access_spider | 35 | 13 | 10 | 94 | 4 | 26 | 14 | 9 | 11 | 16 | 16 | 11 | 23 | 1.4e+02 | 14 | 17 | 85 | 4 | 30 | 22 | 9 | 10 | 11 | 7 | 7 | 11 | 9 | 11 | 44 | 8 | 14 | 19 | 10 | 17 | 17 | 10 | 7 | 10 | 1 | 8 | 27 | 19 | 16 | 2 | 84 | 22 | 14 | 47 | 25 | ... | 38 | 13 | 14 | 17 | 26 | 14 | 10 | 9 | 23 | 15 | 7 | 10 | 7 | 10 | 14 | 17 | 11 | 9 | 11 | 5 | 10 | 8 | 17 | 13 | 23 | 40 | 16 | 17 | 41 | 17 | 8 | 9 | 18 | 12 | 12 | 18 | 13 | 18 | 23 | 10 | 32 | 10 | 26 | 27 | 16 | 11 | 17 | 19 | 10 | 11 |
4 | 52_Hz_I_Love_You_zh.wikipedia.org_all-access_spider | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | ... | 13 | 11 | 8 | 6 | 10 | 14 | 6 | 9 | 6 | 16 | 14 | 13 | 15 | 14 | 16 | 9 | 1.8e+02 | 64 | 12 | 10 | 11 | 6 | 8 | 7 | 9 | 8 | 5 | 11 | 8 | 4 | 15 | 5 | 8 | 8 | 6 | 7 | 15 | 4 | 11 | 7 | 48 | 9 | 25 | 13 | 3 | 11 | 27 | 13 | 36 | 10 |
5 rows × 551 columns
# description said zeros and nans are same
df = df.fillna(0)
df.memory_usage(deep=True).sum() * 1e-6 # MB
658.502795
df.iloc[:,1:].max().max()
67264258.0
df.dtypes
Page object 2015-07-01 float64 2015-07-02 float64 2015-07-03 float64 2015-07-04 float64 ... 2016-12-27 float64 2016-12-28 float64 2016-12-29 float64 2016-12-30 float64 2016-12-31 float64 Length: 551, dtype: object
np.iinfo(np.int32).max
2147483647
np.iinfo(np.int32).max > 67264258.0
True
# I can use int32 as datatype
%%time
df.iloc[:,1:] = df.iloc[:,1:].astype(np.int32)
CPU times: user 52.4 s, sys: 42.4 s, total: 1min 34s Wall time: 1min 39s
df.memory_usage(deep=True).sum() * 1e-6 # MB
339.364195
df.iloc[:2, [0,1]]
Page | 2015-07-01 | |
---|---|---|
0 | 2NE1_zh.wikipedia.org_all-access_spider | 18 |
1 | 2PM_zh.wikipedia.org_all-access_spider | 11 |
t1 = pd.Timestamp('2015-07-01')
t1
Timestamp('2015-07-01 00:00:00')
t2 = pd.Timestamp('2016-01-01')
t2-t1
Timedelta('184 days 00:00:00')
diff = (t2-t1).days
diff
184
df.iloc[:2, [0, diff+1, diff+366, -1]] # 2016 is leap year and iloc is right not-inclusive
Page | 2016-01-01 | 2016-12-31 | 2016-12-31 | |
---|---|---|---|---|
0 | 2NE1_zh.wikipedia.org_all-access_spider | 9 | 20 | 20 |
1 | 2PM_zh.wikipedia.org_all-access_spider | 7 | 20 | 20 |
df = df.iloc[:, np.r_[0,diff+1:diff+1+366]]
df.head()
Page | 2016-01-01 | 2016-01-02 | 2016-01-03 | 2016-01-04 | 2016-01-05 | 2016-01-06 | 2016-01-07 | 2016-01-08 | 2016-01-09 | 2016-01-10 | 2016-01-11 | 2016-01-12 | 2016-01-13 | 2016-01-14 | 2016-01-15 | 2016-01-16 | 2016-01-17 | 2016-01-18 | 2016-01-19 | 2016-01-20 | 2016-01-21 | 2016-01-22 | 2016-01-23 | 2016-01-24 | 2016-01-25 | 2016-01-26 | 2016-01-27 | 2016-01-28 | 2016-01-29 | 2016-01-30 | 2016-01-31 | 2016-02-01 | 2016-02-02 | 2016-02-03 | 2016-02-04 | 2016-02-05 | 2016-02-06 | 2016-02-07 | 2016-02-08 | 2016-02-09 | 2016-02-10 | 2016-02-11 | 2016-02-12 | 2016-02-13 | 2016-02-14 | 2016-02-15 | 2016-02-16 | 2016-02-17 | 2016-02-18 | ... | 2016-11-12 | 2016-11-13 | 2016-11-14 | 2016-11-15 | 2016-11-16 | 2016-11-17 | 2016-11-18 | 2016-11-19 | 2016-11-20 | 2016-11-21 | 2016-11-22 | 2016-11-23 | 2016-11-24 | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2NE1_zh.wikipedia.org_all-access_spider | 9 | 16 | 6 | 19 | 20 | 19 | 22 | 30 | 14 | 16 | 22 | 15 | 15 | 26 | 16 | 13 | 27 | 18 | 13 | 32 | 31 | 16 | 38 | 18 | 9 | 14 | 10 | 24 | 8 | 15 | 18 | 10 | 23 | 17 | 11 | 26 | 14 | 8 | 12 | 9 | 11 | 34 | 17 | 29 | 11 | 9 | 14 | 21 | 12 | ... | 13 | 8 | 15 | 14 | 12 | 6 | 11 | 10 | 42 | 21 | 24 | 14 | 11 | 204 | 14 | 45 | 33 | 28 | 18 | 14 | 47 | 15 | 14 | 18 | 20 | 14 | 16 | 14 | 20 | 60 | 22 | 15 | 17 | 19 | 18 | 21 | 21 | 47 | 65 | 17 | 32 | 63 | 15 | 26 | 14 | 20 | 22 | 19 | 18 | 20 |
1 | 2PM_zh.wikipedia.org_all-access_spider | 7 | 15 | 14 | 14 | 11 | 13 | 12 | 12 | 24 | 15 | 38 | 18 | 26 | 15 | 12 | 14 | 40 | 19 | 13 | 39 | 19 | 16 | 19 | 11 | 76 | 14 | 19 | 26 | 19 | 17 | 30 | 17 | 17 | 17 | 19 | 11 | 175 | 10 | 5 | 12 | 7 | 12 | 14 | 19 | 11 | 19 | 17 | 15 | 19 | ... | 12 | 11 | 14 | 28 | 23 | 20 | 9 | 12 | 11 | 14 | 14 | 15 | 15 | 11 | 20 | 13 | 19 | 621 | 57 | 17 | 23 | 19 | 21 | 47 | 28 | 22 | 22 | 65 | 27 | 17 | 17 | 13 | 9 | 18 | 22 | 17 | 15 | 22 | 23 | 19 | 17 | 42 | 28 | 15 | 9 | 30 | 52 | 45 | 26 | 20 |
2 | 3C_zh.wikipedia.org_all-access_spider | 2 | 0 | 3 | 3 | 3 | 4 | 4 | 8 | 3 | 5 | 8 | 1 | 4 | 0 | 3 | 6 | 3 | 1 | 3 | 3 | 3 | 1 | 3 | 8 | 4 | 3 | 2 | 5 | 6 | 3 | 6 | 5 | 6 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 4 | 3 | 3 | 9 | 4 | 7 | 5 | 10 | ... | 6 | 4 | 2 | 4 | 6 | 5 | 4 | 4 | 3 | 3 | 9 | 3 | 5 | 4 | 0 | 1 | 4 | 5 | 8 | 8 | 1 | 1 | 2 | 5 | 3 | 3 | 3 | 7 | 3 | 9 | 8 | 3 | 210 | 5 | 4 | 6 | 2 | 2 | 4 | 3 | 3 | 1 | 1 | 7 | 4 | 4 | 6 | 3 | 4 | 17 |
3 | 4minute_zh.wikipedia.org_all-access_spider | 7 | 7 | 11 | 7 | 14 | 9 | 21 | 9 | 10 | 13 | 10 | 13 | 16 | 8 | 10 | 7 | 13 | 18 | 8 | 50 | 8 | 33 | 6 | 22 | 9 | 84 | 28 | 11 | 7 | 14 | 16 | 49 | 71 | 29 | 22 | 6 | 34 | 16 | 14 | 9 | 12 | 24 | 18 | 8 | 26 | 8 | 8 | 13 | 21 | ... | 38 | 13 | 14 | 17 | 26 | 14 | 10 | 9 | 23 | 15 | 7 | 10 | 7 | 10 | 14 | 17 | 11 | 9 | 11 | 5 | 10 | 8 | 17 | 13 | 23 | 40 | 16 | 17 | 41 | 17 | 8 | 9 | 18 | 12 | 12 | 18 | 13 | 18 | 23 | 10 | 32 | 10 | 26 | 27 | 16 | 11 | 17 | 19 | 10 | 11 |
4 | 52_Hz_I_Love_You_zh.wikipedia.org_all-access_spider | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 13 | 11 | 8 | 6 | 10 | 14 | 6 | 9 | 6 | 16 | 14 | 13 | 15 | 14 | 16 | 9 | 178 | 64 | 12 | 10 | 11 | 6 | 8 | 7 | 9 | 8 | 5 | 11 | 8 | 4 | 15 | 5 | 8 | 8 | 6 | 7 | 15 | 4 | 11 | 7 | 48 | 9 | 25 | 13 | 3 | 11 | 27 | 13 | 36 | 10 |
5 rows × 367 columns
df = df.sample(frac=0.1,random_state=SEED)
df = df.melt(id_vars=['Page'],var_name='date',value_name='visits')
print(df.shape)
df.head()
(5309196, 3)
Page | date | visits | |
---|---|---|---|
0 | Sean_Connery_en.wikipedia.org_desktop_all-agents | 2016-01-01 | 4872 |
1 | Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents | 2016-01-01 | 6 |
2 | The_Undertaker_fr.wikipedia.org_mobile-web_all-agents | 2016-01-01 | 469 |
3 | Category:Outdoor_sex_commons.wikimedia.org_all-access_all-agents | 2016-01-01 | 142 |
4 | Камызяк_ru.wikipedia.org_all-access_all-agents | 2016-01-01 | 6692 |
df['date'] = pd.to_datetime(df['date'])
df.dtypes
Page object date datetime64[ns] visits int32 dtype: object
show_method_attributes(df['date'].dt)
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | ceil | days_in_month | is_month_end | isocalendar | normalize | timetz | tz_localize |
1 | date | daysinmonth | is_month_start | microsecond | quarter | to_period | week |
2 | day | floor | is_quarter_end | minute | round | to_pydatetime | weekday |
3 | day_name | freq | is_quarter_start | month | second | tz | weekofyear |
4 | dayofweek | hour | is_year_end | month_name | strftime | tz_convert | year |
5 | dayofyear | is_leap_year | is_year_start | nanosecond |
%%time
# we can make these values categorical to reduce memory,
# but later for modelling, we need numpy array.
df['year'] = df['date'].dt.year # yyyy
df['month'] = df['date'].dt.month # 1 to 12
df['day'] = df['date'].dt.day # 1 to 31
df['quarter'] = df['date'].dt.quarter # 1 to 4
df['dayofweek'] = df['date'].dt.dayofweek # 0 to 6
df['dayofyear'] = df['date'].dt.dayofyear # 1 to 366 (leap year)
df['day_name'] = df['date'].dt.day_name() # Monday
df['month_name'] = df['date'].dt.month_name() # January
df['weekend'] = ((df['date'].dt.dayofweek) // 5 == 1)
df['weekday'] = ((df['date'].dt.dayofweek) // 5 != 1)
CPU times: user 3.66 s, sys: 136 ms, total: 3.8 s Wall time: 3.81 s
df.head(20)
Page | date | visits | year | month | day | quarter | dayofweek | dayofyear | day_name | month_name | weekend | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sean_Connery_en.wikipedia.org_desktop_all-agents | 2016-01-01 | 4872 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
1 | Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents | 2016-01-01 | 6 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
2 | The_Undertaker_fr.wikipedia.org_mobile-web_all-agents | 2016-01-01 | 469 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
3 | Category:Outdoor_sex_commons.wikimedia.org_all-access_all-agents | 2016-01-01 | 142 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
4 | Камызяк_ru.wikipedia.org_all-access_all-agents | 2016-01-01 | 6692 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
5 | File:PioneerSodHouse-WheatRidgeCO.jpg_commons.wikimedia.org_desktop_all-agents | 2016-01-01 | 0 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
6 | Международная_космическая_станция_ru.wikipedia.org_all-access_spider | 2016-01-01 | 33 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
7 | Volleyball_at_the_2016_Summer_Olympics_–_Men's_tournament_en.wikipedia.org_all-access_all-agents | 2016-01-01 | 123 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
8 | Marianne_James_fr.wikipedia.org_all-access_all-agents | 2016-01-01 | 424 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
9 | 頑童MJ116_zh.wikipedia.org_desktop_all-agents | 2016-01-01 | 254 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
10 | 龍八夷_zh.wikipedia.org_desktop_all-agents | 2016-01-01 | 1149 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
11 | The_Lego_Ninjago_Movie_en.wikipedia.org_all-access_all-agents | 2016-01-01 | 0 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
12 | 囲碁_ja.wikipedia.org_desktop_all-agents | 2016-01-01 | 179 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
13 | Napoleón_Bonaparte_es.wikipedia.org_mobile-web_all-agents | 2016-01-01 | 1415 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
14 | The_First_Avenger:_Civil_War_de.wikipedia.org_all-access_all-agents | 2016-01-01 | 1189 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
15 | Диссоциальное_расстройство_личности_ru.wikipedia.org_all-access_all-agents | 2016-01-01 | 1971 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
16 | MediaWiki:Sitenotice-translation_commons.wikimedia.org_desktop_all-agents | 2016-01-01 | 1 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
17 | Daniel_Radcliffe_fr.wikipedia.org_mobile-web_all-agents | 2016-01-01 | 968 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
18 | Gotham_(série_télévisée)_fr.wikipedia.org_all-access_all-agents | 2016-01-01 | 1482 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
19 | Manual:Update.php_www.mediawiki.org_all-access_spider | 2016-01-01 | 5 | 2016 | 1 | 1 | 1 | 4 | 1 | Friday | January | False | True |
%%time
df['mean'] = df.groupby('Page')['visits'].transform('mean')
df['median'] = df.groupby('Page')['visits'].transform('median')
CPU times: user 2.06 s, sys: 7.66 ms, total: 2.07 s Wall time: 2.08 s
Example of one page:
2NE1_zh.wikipedia.org_all-access_spider
name_project_access_agent
df['Page'].head(10)
0 Sean_Connery_en.wikipedia.org_desktop_all-agents 1 Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents 2 The_Undertaker_fr.wikipedia.org_mobile-web_all-agents 3 Category:Outdoor_sex_commons.wikimedia.org_all-access_all-agents 4 Камызяк_ru.wikipedia.org_all-access_all-agents 5 File:PioneerSodHouse-WheatRidgeCO.jpg_commons.wikimedia.org_desktop_all-agents 6 Международная_космическая_станция_ru.wikipedia.org_all-access_spider 7 Volleyball_at_the_2016_Summer_Olympics_–_Men's_tournament_en.wikipedia.org_all-access_all-agents 8 Marianne_James_fr.wikipedia.org_all-access_all-agents 9 頑童MJ116_zh.wikipedia.org_desktop_all-agents Name: Page, dtype: object
regex = ( r'(.+)_' # name
r'(.+)_' # project
r'(.+)_' # access
r'(.+)' # agent
)
lang_map ={'en':'English','ja':'Japanese','de':'German',
'www':'Media','fr':'French','zh':'Chinese',
'ru':'Russian','es':'Spanish','commons': 'Media'
}
# another way
#
# df['agent'] = df['page'].str.split('_').str[-1]
# df['access'] = df['page'].str.split('_').str[-2]
# df['project'] = df['page'].str.split('_').str[-3]
# df['name'] = df['page'].str.split('_').str[:-3].str.join('_')
def myfunc(df):
df = df.copy()
df[['name','project','access','agent']] = df['Page'].str.extract(regex,expand=True)
df['lang'] = df['project'].str.split('.').str[0]
df['language'] = df['lang'].map(lang_map)
return df
import dask.dataframe as dd
import gc
ddf = dd.from_pandas(df, npartitions=40)
def dask_apply():
return ddf.map_partitions(myfunc).compute()
df = dask_apply()
del ddf
gc.collect()
df.head()
Page | date | visits | name | project | access | agent | lang | language | |
---|---|---|---|---|---|---|---|---|---|
0 | Sean_Connery_en.wikipedia.org_desktop_all-agents | 2016-01-01 | 4872 | Sean_Connery | en.wikipedia.org | desktop | all-agents | en | English |
1 | Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents | 2016-01-01 | 6 | Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008 | fr.wikipedia.org | desktop | all-agents | fr | French |
2 | The_Undertaker_fr.wikipedia.org_mobile-web_all-agents | 2016-01-01 | 469 | The_Undertaker | fr.wikipedia.org | mobile-web | all-agents | fr | French |
3 | Category:Outdoor_sex_commons.wikimedia.org_all-access_all-agents | 2016-01-01 | 142 | Category:Outdoor_sex | commons.wikimedia.org | all-access | all-agents | commons | Media |
4 | Камызяк_ru.wikipedia.org_all-access_all-agents | 2016-01-01 | 6692 | Камызяк | ru.wikipedia.org | all-access | all-agents | ru | Russian |
for x in ['project','agent','access','lang']:
print(x)
print(df[x].value_counts())
print()
project en.wikipedia.org 863028 ja.wikipedia.org 756156 de.wikipedia.org 688080 fr.wikipedia.org 637938 zh.wikipedia.org 637572 ru.wikipedia.org 566568 es.wikipedia.org 506544 commons.wikimedia.org 390888 www.mediawiki.org 262422 Name: project, dtype: int64 agent all-agents 4050156 spider 1259040 Name: agent, dtype: int64 access all-access 2705106 mobile-web 1331874 desktop 1272216 Name: access, dtype: int64 lang en 863028 ja 756156 de 688080 fr 637938 zh 637572 ru 566568 es 506544 commons 390888 www 262422 Name: lang, dtype: int64
df.head(2)
Page | date | visits | name | project | access | agent | lang | language | |
---|---|---|---|---|---|---|---|---|---|
0 | Sean_Connery_en.wikipedia.org_desktop_all-agents | 2016-01-01 | 4872 | Sean_Connery | en.wikipedia.org | desktop | all-agents | en | English |
1 | Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008_fr.wikipedia.org_desktop_all-agents | 2016-01-01 | 6 | Tableau_des_médailles_des_Jeux_olympiques_d'été_de_2008 | fr.wikipedia.org | desktop | all-agents | fr | French |
# top pages per language
df.groupby('language')['visits'].apply(lambda x: df.loc[x.nlargest(1).index])
Page | date | visits | name | project | access | agent | lang | language | ||
---|---|---|---|---|---|---|---|---|---|---|
language | ||||||||||
Chinese | 3526717 | 緋彈的亞莉亞角色列表_zh.wikipedia.org_desktop_all-agents | 2016-08-31 | 243557 | 緋彈的亞莉亞角色列表 | zh.wikipedia.org | desktop | all-agents | zh | Chinese |
English | 2714919 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-07-06 | 16592075 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
French | 2163034 | Wikipédia:Accueil_principal_fr.wikipedia.org_all-access_all-agents | 2016-05-29 | 1845404 | Wikipédia:Accueil_principal | fr.wikipedia.org | all-access | all-agents | fr | French |
German | 4439792 | Gerätestecker_de.wikipedia.org_desktop_all-agents | 2016-11-02 | 558381 | Gerätestecker | de.wikipedia.org | desktop | all-agents | de | German |
Japanese | 2724137 | デイヴィッド・ロックフェラー_ja.wikipedia.org_all-access_all-agents | 2016-07-06 | 1651272 | デイヴィッド・ロックフェラー | ja.wikipedia.org | all-access | all-agents | ja | Japanese |
Media | 1865460 | Parsoid/Developer_Setup_www.mediawiki.org_all-access_all-agents | 2016-05-08 | 927825 | Parsoid/Developer_Setup | www.mediawiki.org | all-access | all-agents | www | Media |
Russian | 4179962 | Служебная:Поиск_ru.wikipedia.org_all-access_all-agents | 2016-10-15 | 1412292 | Служебная:Поиск | ru.wikipedia.org | all-access | all-agents | ru | Russian |
Spanish | 2376682 | Nilo_es.wikipedia.org_desktop_all-agents | 2016-06-12 | 783454 | Nilo | es.wikipedia.org | desktop | all-agents | es | Spanish |
idx = df.groupby('Page')['visits'].sum().idxmax()
idx
'Special:Search_en.wikipedia.org_desktop_all-agents'
df.query(""" Page == @idx """).head()
Page | date | visits | name | project | access | agent | lang | language | |
---|---|---|---|---|---|---|---|---|---|
2297 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-01-01 | 1401667 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
16803 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-01-02 | 1395136 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
31309 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-01-03 | 1455522 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
45815 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-01-04 | 1750373 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
60321 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-01-05 | 1787494 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
# I am using page other than special search.
df.query('lang == "en"').nlargest(5,'visits')
Page | date | visits | name | project | access | agent | lang | language | |
---|---|---|---|---|---|---|---|---|---|
2714919 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-07-06 | 16592075 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
3556267 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-09-02 | 7599524 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
3570773 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-09-03 | 6894531 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
3541761 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-09-01 | 6878515 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
3585279 | Special:Search_en.wikipedia.org_desktop_all-agents | 2016-09-04 | 6457072 | Special:Search | en.wikipedia.org | desktop | all-agents | en | English |
df.query('lang == "en" and name != "Special:Search" ').nlargest(5,'visits')
Page | date | visits | name | project | access | agent | lang | language | |
---|---|---|---|---|---|---|---|---|---|
801124 | Web_scraping_en.wikipedia.org_all-access_all-agents | 2016-02-25 | 4656065 | Web_scraping | en.wikipedia.org | all-access | all-agents | en | English |
194725 | Alan_Rickman_en.wikipedia.org_all-access_all-agents | 2016-01-14 | 3402109 | Alan_Rickman | en.wikipedia.org | all-access | all-agents | en | English |
757606 | Web_scraping_en.wikipedia.org_all-access_all-agents | 2016-02-22 | 3337999 | Web_scraping | en.wikipedia.org | all-access | all-agents | en | English |
1624114 | Prince_(musician)_en.wikipedia.org_mobile-web_all-agents | 2016-04-21 | 3320724 | Prince_(musician) | en.wikipedia.org | mobile-web | all-agents | en | English |
1638620 | Prince_(musician)_en.wikipedia.org_mobile-web_all-agents | 2016-04-22 | 3290304 | Prince_(musician) | en.wikipedia.org | mobile-web | all-agents | en | English |
# Now I see some interesting pages such as webscraping, allan-rickman and prince.
# I will take prince as the timeseries data for modelling.
df.head(2)
Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | 2015-08-06 | 2015-08-07 | 2015-08-08 | 2015-08-09 | 2015-08-10 | 2015-08-11 | 2015-08-12 | 2015-08-13 | 2015-08-14 | 2015-08-15 | 2015-08-16 | 2015-08-17 | 2015-08-18 | ... | 2016-11-12 | 2016-11-13 | 2016-11-14 | 2016-11-15 | 2016-11-16 | 2016-11-17 | 2016-11-18 | 2016-11-19 | 2016-11-20 | 2016-11-21 | 2016-11-22 | 2016-11-23 | 2016-11-24 | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2NE1_zh.wikipedia.org_all-access_spider | 18 | 11 | 5 | 13 | 14 | 9 | 9 | 22 | 26 | 24 | 19 | 10 | 14 | 15 | 8 | 16 | 8 | 8 | 16 | 7 | 11 | 10 | 20 | 18 | 15 | 14 | 49 | 10 | 16 | 18 | 8 | 5 | 9 | 7 | 13 | 9 | 7 | 4 | 11 | 10 | 5 | 9 | 9 | 9 | 9 | 13 | 4 | 15 | 25 | ... | 13 | 8 | 15 | 14 | 12 | 6 | 11 | 10 | 42 | 21 | 24 | 14 | 11 | 204 | 14 | 45 | 33 | 28 | 18 | 14 | 47 | 15 | 14 | 18 | 20 | 14 | 16 | 14 | 20 | 60 | 22 | 15 | 17 | 19 | 18 | 21 | 21 | 47 | 65 | 17 | 32 | 63 | 15 | 26 | 14 | 20 | 22 | 19 | 18 | 20 |
1 | 2PM_zh.wikipedia.org_all-access_spider | 11 | 14 | 15 | 18 | 11 | 13 | 22 | 11 | 10 | 4 | 41 | 65 | 57 | 38 | 20 | 62 | 44 | 15 | 10 | 47 | 24 | 17 | 22 | 9 | 39 | 13 | 11 | 12 | 21 | 19 | 9 | 15 | 33 | 8 | 8 | 7 | 13 | 2 | 23 | 12 | 27 | 27 | 36 | 23 | 58 | 80 | 60 | 69 | 42 | ... | 12 | 11 | 14 | 28 | 23 | 20 | 9 | 12 | 11 | 14 | 14 | 15 | 15 | 11 | 20 | 13 | 19 | 621 | 57 | 17 | 23 | 19 | 21 | 47 | 28 | 22 | 22 | 65 | 27 | 17 | 17 | 13 | 9 | 18 | 22 | 17 | 15 | 22 | 23 | 19 | 17 | 42 | 28 | 15 | 9 | 30 | 52 | 45 | 26 | 20 |
2 rows × 551 columns
cond = train['Page'].str.lower().str.startswith('prince_(musician)_en')
df_prince = train.loc[cond]
df_prince.head()
Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | 2015-08-06 | 2015-08-07 | 2015-08-08 | 2015-08-09 | 2015-08-10 | 2015-08-11 | 2015-08-12 | 2015-08-13 | 2015-08-14 | 2015-08-15 | 2015-08-16 | 2015-08-17 | 2015-08-18 | ... | 2016-11-12 | 2016-11-13 | 2016-11-14 | 2016-11-15 | 2016-11-16 | 2016-11-17 | 2016-11-18 | 2016-11-19 | 2016-11-20 | 2016-11-21 | 2016-11-22 | 2016-11-23 | 2016-11-24 | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11763 | Prince_(musician)_en.wikipedia.org_desktop_all-agents | 3730 | 6722 | 3627 | 3139 | 2849 | 3587 | 4642 | 3904 | 3362 | 3377 | 2767 | 2696 | 3354 | 4100 | 4648 | 3492 | 3121 | 2865 | 2830 | 3284 | 3351 | 3550 | 3617 | 4122 | 3257 | 3376 | 3993 | 4709 | 3396 | 4071 | 3898 | 2850 | 2803 | 3153 | 3444 | 3923 | 4888 | 4550 | 5569 | 5472 | 7046 | 4872 | 3754 | 3755 | 3654 | 3865 | 4376 | 3604 | 3659 | ... | 3419 | 4066 | 4829 | 4377 | 4505 | 4863 | 4824 | 4046 | 3737 | 6632 | 4919 | 4416 | 3740 | 4304 | 4195 | 4665 | 4986 | 5256 | 4865 | 4726 | 4581 | 3750 | 3888 | 4698 | 4823 | 5187 | 5358 | 5153 | 4560 | 4726 | 4888 | 5038 | 5749 | 5693 | 5105 | 3964 | 4430 | 5577 | 4917 | 4489 | 5388 | 4471 | 4309 | 4512 | 13619 | 12610 | 10483 | 8968 | 7914 | 8271 |
35622 | Prince_(musician)_en.wikipedia.org_all-access_spider | 95 | 98 | 78 | 78 | 86 | 73 | 81 | 109 | 69 | 65 | 89 | 48 | 62 | 74 | 58 | 46 | 48 | 49 | 91 | 57 | 32 | 69 | 74 | 56 | 35 | 37 | 61 | 137 | 82 | 63 | 68 | 69 | 51 | 31 | 40 | 41 | 114 | 75 | 46 | 55 | 82 | 48 | 38 | 40 | 27 | 24 | 33 | 42 | 36 | ... | 143 | 140 | 178 | 142 | 141 | 171 | 152 | 178 | 126 | 163 | 145 | 182 | 189 | 261 | 174 | 150 | 138 | 157 | 148 | 189 | 161 | 184 | 154 | 151 | 182 | 191 | 153 | 213 | 579 | 151 | 166 | 143 | 299 | 182 | 180 | 185 | 149 | 882 | 196 | 185 | 172 | 197 | 157 | 169 | 291 | 247 | 245 | 204 | 171 | 209 |
40563 | Prince_(musician)_en.wikipedia.org_all-access_all-agents | 9529 | 13627 | 9163 | 8222 | 7769 | 7640 | 8411 | 8746 | 6970 | 7072 | 7134 | 7313 | 6969 | 12577 | 16418 | 7487 | 7072 | 7314 | 7196 | 6785 | 6736 | 7661 | 9605 | 10634 | 8393 | 8548 | 8893 | 8960 | 7119 | 7939 | 7877 | 7309 | 7366 | 6505 | 7339 | 7798 | 9699 | 9754 | 14827 | 15815 | 14232 | 9910 | 8185 | 8089 | 8374 | 8933 | 10286 | 8747 | 7268 | ... | 8474 | 10774 | 9190 | 8220 | 8744 | 10619 | 12532 | 10791 | 9323 | 22885 | 10711 | 9349 | 9880 | 10420 | 11213 | 16069 | 11077 | 11055 | 9930 | 10507 | 10964 | 10104 | 10321 | 10005 | 9909 | 10521 | 11002 | 11153 | 13712 | 15153 | 10304 | 10504 | 12701 | 14971 | 12159 | 10778 | 11292 | 10883 | 9788 | 9856 | 13222 | 11297 | 15963 | 17002 | 49774 | 34560 | 31090 | 22827 | 19956 | 31446 |
76038 | Prince_(musician)_en.wikipedia.org_mobile-web_all-agents | 5675 | 6705 | 5348 | 4951 | 4771 | 3937 | 3673 | 4708 | 3501 | 3576 | 4236 | 4505 | 3513 | 8316 | 11610 | 3911 | 3852 | 4342 | 4246 | 3381 | 3300 | 4011 | 5835 | 6342 | 4976 | 4992 | 4753 | 4131 | 3615 | 3742 | 3848 | 4316 | 4421 | 3258 | 3780 | 3742 | 4678 | 5045 | 8957 | 10023 | 6965 | 4883 | 4285 | 4201 | 4585 | 4912 | 5742 | 5035 | 3512 | ... | 4864 | 6540 | 4210 | 3730 | 4115 | 5601 | 7471 | 6495 | 5398 | 15921 | 5617 | 4810 | 5943 | 5901 | 6809 | 11134 | 5910 | 5628 | 4911 | 5617 | 6205 | 6144 | 6220 | 5136 | 4926 | 5147 | 5479 | 5827 | 8897 | 10120 | 5266 | 5307 | 6780 | 9051 | 6849 | 6595 | 6663 | 5141 | 4729 | 5215 | 7605 | 6599 | 11256 | 11939 | 34864 | 21210 | 19957 | 13400 | 11663 | 22533 |
4 rows × 551 columns
cond = df['Page'] == "Prince_(musician)_en.wikipedia.org_all-access_all-agents"
df_prince = df.loc[cond]
df_prince.head()
# this is one row, i need to melt it later.
Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | 2015-08-06 | 2015-08-07 | 2015-08-08 | 2015-08-09 | 2015-08-10 | 2015-08-11 | 2015-08-12 | 2015-08-13 | 2015-08-14 | 2015-08-15 | 2015-08-16 | 2015-08-17 | 2015-08-18 | ... | 2016-11-12 | 2016-11-13 | 2016-11-14 | 2016-11-15 | 2016-11-16 | 2016-11-17 | 2016-11-18 | 2016-11-19 | 2016-11-20 | 2016-11-21 | 2016-11-22 | 2016-11-23 | 2016-11-24 | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
40563 | Prince_(musician)_en.wikipedia.org_all-access_all-agents | 9529 | 13627 | 9163 | 8222 | 7769 | 7640 | 8411 | 8746 | 6970 | 7072 | 7134 | 7313 | 6969 | 12577 | 16418 | 7487 | 7072 | 7314 | 7196 | 6785 | 6736 | 7661 | 9605 | 10634 | 8393 | 8548 | 8893 | 8960 | 7119 | 7939 | 7877 | 7309 | 7366 | 6505 | 7339 | 7798 | 9699 | 9754 | 14827 | 15815 | 14232 | 9910 | 8185 | 8089 | 8374 | 8933 | 10286 | 8747 | 7268 | ... | 8474 | 10774 | 9190 | 8220 | 8744 | 10619 | 12532 | 10791 | 9323 | 22885 | 10711 | 9349 | 9880 | 10420 | 11213 | 16069 | 11077 | 11055 | 9930 | 10507 | 10964 | 10104 | 10321 | 10005 | 9909 | 10521 | 11002 | 11153 | 13712 | 15153 | 10304 | 10504 | 12701 | 14971 | 12159 | 10778 | 11292 | 10883 | 9788 | 9856 | 13222 | 11297 | 15963 | 17002 | 49774 | 34560 | 31090 | 22827 | 19956 | 31446 |
1 rows × 551 columns
df_prince.filter(regex="Page|2016").iloc[:5, np.r_[0, 1,2,-2,-1]]
Page | 2016-01-01 | 2016-01-02 | 2016-12-30 | 2016-12-31 | |
---|---|---|---|---|---|
40563 | Prince_(musician)_en.wikipedia.org_all-access_all-agents | 20947 | 19466 | 19956 | 31446 |
# %%time
# df.to_csv('../data/data_cleaned_2016_sample.csv',index=False)
# 1.14 GB, cant be uploaded to github.
# CPU times: user 2min 15s, sys: 2.14 s, total: 2min 17s
# Wall time: 2min 36s
# pandas is faster to write file.
# %%time
# ddf = dd.from_pandas(df, npartitions=40)
# ddf.to_csv('../../data/wiki/processed/wikipedia_2016_frac01.csv',index=False)
# CPU times: user 2min 34s, sys: 3.74 s, total: 2min 38s
# Wall time: 2min 45s
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
'{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole notebook: 0 hr 2 min 14 secs