Problem Description

Goal

Online shops often sell tons of different items and this can become very messy very quickly!

Data science can be extremely useful to automatically organize the products in categories so that they can be easily found by the customers.

The goal of this challenge is to look at user purchase history and create categories of items that are likely to be bought together and, therefore, should belong to the same section.

Challenge Description

Company XYZ is an online grocery store. In the current version of the website, they have manually grouped the items into a few categories based on their experience.

However, they now have a lot of data about user purchase history. Therefore, they would like to put the data into use!

This is what they asked you to do:

  1. The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder:

a. The customer who bought the most items overall in her lifetime

b. For each item, the customer who bought that product the most

  1. Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.

Imports

In [1]:
import time
time_start_notebook = time.time()

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

pd.options.display.max_columns = 100

SEED = 100
np.random.seed(SEED) # we need this in each cell that calls random

plt.style.use('ggplot')
%matplotlib inline
In [2]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

Load the data

In [3]:
!ls data/
item_to_id.csv       purchase_history.csv
In [4]:
df_item = pd.read_csv('data/item_to_id.csv')

print(df_item.shape)
df_item.head(2).append(df_item.tail(2))
(48, 2)
Out[4]:
Item_name Item_id
0 coffee 43
1 tea 23
46 baby items 4
47 pet items 3
In [5]:
df_item['Item_id'].nunique(), df_item.shape[0] # we have 48 different item_id
Out[5]:
(48, 48)
In [6]:
df_item['Item_name'].nunique(), df_item.shape[0] # we have 48 different item_id
Out[6]:
(48, 48)
In [7]:
df_item.set_index('Item_id',inplace=True)
In [8]:
df_item.head(2)
Out[8]:
Item_name
Item_id
43 coffee
23 tea
In [9]:
# df_item['Item_name'].to_dict()
In [10]:
df = pd.read_csv('data/purchase_history.csv')

print(df.shape)
df.head(2).append(df.tail(2)) # we have 39k customers
(39474, 2)
Out[10]:
user_id id
0 222087 27,26
1 1343649 6,47,17
39472 651509 13,39,20,32,37,1,3,18,34
39473 517744 39,40,20,3,10,42,38,43,28,25,13
In [11]:
df['user_id'].nunique(), df.shape[0]
Out[11]:
(24885, 39474)
In [12]:
# user_id is not unique, same user has shopped more than once.
In [13]:
df['id'] = df['id'].astype(str).str.split(',')
In [14]:
df.head(2)
Out[14]:
user_id id
0 222087 [27, 26]
1 1343649 [6, 47, 17]
In [15]:
df = df.explode('id')
In [16]:
df = pd.crosstab(df['user_id'], df['id'],margins=True,margins_name='total')
df.head()
Out[16]:
id 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 4 40 41 42 43 44 45 46 47 48 5 6 7 8 9 total
user_id
47 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 13
68 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 8
113 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 11
123 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
223 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 11
In [17]:
df.columns
Out[17]:
Index(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2',
       '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30',
       '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41',
       '42', '43', '44', '45', '46', '47', '48', '5', '6', '7', '8', '9',
       'total'],
      dtype='object', name='id')
In [18]:
# df_item['Item_name'].to_dict()
In [19]:
mapping = df_item['Item_name'].to_dict()
mapping = {str(k): v for k,v in mapping.items()} # make string
# mapping
In [20]:
df.columns = df.columns.map(mapping).fillna('item_total')
df.head(2)
Out[20]:
id sugar carrots cereals shampoo bagels eggs aluminum foil milk beef laundry detergent shaving cream lettuce grapefruit cheeses frozen vegetables tea paper towels cherries spaghetti sauce dishwashing canned vegetables hand soap pet items flour pasta apples toilet paper tortillas soap ice cream dinner rolls juice sandwich loaves baby items berries ketchup cucumbers coffee broccoli cauliflower bananas pork yogurt waffles poultry sandwich bags butter soda item_total
user_id
47 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 13
68 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 8
In [21]:
df.index = 'user_' + df.index.astype(str)

df.head(2).append(df.tail(2))
Out[21]:
id sugar carrots cereals shampoo bagels eggs aluminum foil milk beef laundry detergent shaving cream lettuce grapefruit cheeses frozen vegetables tea paper towels cherries spaghetti sauce dishwashing canned vegetables hand soap pet items flour pasta apples toilet paper tortillas soap ice cream dinner rolls juice sandwich loaves baby items berries ketchup cucumbers coffee broccoli cauliflower bananas pork yogurt waffles poultry sandwich bags butter soda item_total
user_id
user_47 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 13
user_68 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 8
user_1499974 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 13
user_total 9119 8859 2730 3270 8695 5700 3281 5782 8627 6000 3346 14473 8951 5666 5977 8723 3278 9019 6449 6115 6365 3356 7204 2631 2731 9008 3322 8782 3282 6042 8695 8814 8728 3262 8878 6420 9050 8784 8980 8902 8921 8682 5673 2817 8730 3343 5717 8816 319995

Qn1: the customer who bought the most items overall in her lifetime

In [22]:
df.iloc[:-1].nlargest(5,'item_total').iloc[:,-1:]
Out[22]:
id item_total
user_id
user_269335 72
user_367872 70
user_397623 64
user_599172 64
user_377284 63
In [23]:
"""
This is the list of top5 users who bought the most items.
""";

Qn2: for each item, the customer who bought that product the most

In [24]:
df.head(2).append(df.tail(2))
Out[24]:
id sugar carrots cereals shampoo bagels eggs aluminum foil milk beef laundry detergent shaving cream lettuce grapefruit cheeses frozen vegetables tea paper towels cherries spaghetti sauce dishwashing canned vegetables hand soap pet items flour pasta apples toilet paper tortillas soap ice cream dinner rolls juice sandwich loaves baby items berries ketchup cucumbers coffee broccoli cauliflower bananas pork yogurt waffles poultry sandwich bags butter soda item_total
user_id
user_47 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 13
user_68 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 8
user_1499974 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 13
user_total 9119 8859 2730 3270 8695 5700 3281 5782 8627 6000 3346 14473 8951 5666 5977 8723 3278 9019 6449 6115 6365 3356 7204 2631 2731 9008 3322 8782 3282 6042 8695 8814 8728 3262 8878 6420 9050 8784 8980 8902 8921 8682 5673 2817 8730 3343 5717 8816 319995
In [25]:
df.iloc[:-1].T.idxmax(axis=1).to_frame('user').head()
Out[25]:
user
id
sugar user_31625
carrots user_618914
cereals user_367872
shampoo user_557904
bagels user_653800
In [26]:
df.iloc[:-1].apply(lambda s: pd.Series(
    [s.idxmax(), s.max()],
    index=['max_user','max_count']
    )).T.head()
Out[26]:
max_user max_count
id
sugar user_31625 4
carrots user_618914 4
cereals user_367872 3
shampoo user_557904 3
bagels user_653800 4
In [27]:
# validate the result
df.head(2)
Out[27]:
id sugar carrots cereals shampoo bagels eggs aluminum foil milk beef laundry detergent shaving cream lettuce grapefruit cheeses frozen vegetables tea paper towels cherries spaghetti sauce dishwashing canned vegetables hand soap pet items flour pasta apples toilet paper tortillas soap ice cream dinner rolls juice sandwich loaves baby items berries ketchup cucumbers coffee broccoli cauliflower bananas pork yogurt waffles poultry sandwich bags butter soda item_total
user_id
user_47 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 13
user_68 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 8
In [28]:
df.loc['user_31625'].to_frame().T # this user has bought 4 sugars in total
Out[28]:
id sugar carrots cereals shampoo bagels eggs aluminum foil milk beef laundry detergent shaving cream lettuce grapefruit cheeses frozen vegetables tea paper towels cherries spaghetti sauce dishwashing canned vegetables hand soap pet items flour pasta apples toilet paper tortillas soap ice cream dinner rolls juice sandwich loaves baby items berries ketchup cucumbers coffee broccoli cauliflower bananas pork yogurt waffles poultry sandwich bags butter soda item_total
user_31625 4 2 0 0 0 0 0 0 0 0 3 5 1 0 2 1 0 2 0 1 1 2 1 1 1 1 0 0 2 0 3 0 0 0 1 0 2 2 4 2 3 1 1 1 1 0 1 2 54
In [29]:
df[['sugar']].T.filter(regex='user_31625')
Out[29]:
user_id user_31625
id
sugar 4

Qn3: Modelling Clustering of Similar Items

Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.

In [30]:
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
In [31]:
df.head(2).append(df.tail(2)).iloc[:,-5:]
Out[31]:
id poultry sandwich bags butter soda item_total
user_id
user_47 0 0 0 0 13
user_68 1 0 0 0 8
user_1499974 0 1 1 0 13
user_total 8730 3343 5717 8816 319995
In [32]:
df.shape
Out[32]:
(24886, 49)
In [33]:
# remove colmun and index named total
df = df.iloc[:-1,:-1]

print(df.shape)

df.head(2).append(df.tail(2)).iloc[:,-5:]
(24885, 48)
Out[33]:
id waffles poultry sandwich bags butter soda
user_id
user_47 0 0 0 0 0
user_68 0 1 0 0 0
user_1499817 0 1 0 1 1
user_1499974 1 0 1 1 0

similarity matrix

In [34]:
from sklearn.preprocessing import normalize

item_norm = normalize(df,axis=0) # normalize each items (NOT users)
item_sim = item_norm.T.dot(item_norm)

df_item_sim = pd.DataFrame(item_sim,
                       index=df.columns,
                       columns=df.columns)

print(df_item_sim.shape)
df_item_sim.head(2)
(48, 48)
Out[34]:
id sugar carrots cereals shampoo bagels eggs aluminum foil milk beef laundry detergent shaving cream lettuce grapefruit cheeses frozen vegetables tea paper towels cherries spaghetti sauce dishwashing canned vegetables hand soap pet items flour pasta apples toilet paper tortillas soap ice cream dinner rolls juice sandwich loaves baby items berries ketchup cucumbers coffee broccoli cauliflower bananas pork yogurt waffles poultry sandwich bags butter soda
id
sugar 1.000000 0.390641 0.263889 0.260103 0.392767 0.336052 0.258082 0.338954 0.386424 0.338680 0.273667 0.506895 0.391199 0.339366 0.344111 0.397598 0.272608 0.398554 0.351104 0.338670 0.357550 0.270909 0.420145 0.259259 0.264944 0.395828 0.275207 0.403212 0.262791 0.345662 0.395159 0.401602 0.388034 0.296986 0.390286 0.358599 0.393056 0.395696 0.396766 0.390253 0.394998 0.392164 0.328221 0.271132 0.388250 0.271743 0.335303 0.403690
carrots 0.390641 1.000000 0.232861 0.246780 0.379568 0.323319 0.257784 0.325523 0.386199 0.325983 0.256384 0.527894 0.382947 0.322191 0.335931 0.377047 0.258435 0.384620 0.332691 0.330508 0.343154 0.257511 0.352597 0.236888 0.239542 0.375208 0.261217 0.379041 0.248531 0.324006 0.378677 0.382540 0.381372 0.258313 0.380613 0.333471 0.488734 0.372790 0.483275 0.483897 0.384714 0.381086 0.327807 0.234254 0.376518 0.250765 0.338472 0.381207

Choose number of clusters K

In [35]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
In [36]:
kmeans = KMeans()
kmeans
Out[36]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [37]:
inertias = []
silhouettes = []

ks = range(2,30)
for k in ks:
    kmeans = KMeans(n_clusters=k,random_state=SEED,
                init='k-means++',n_jobs=-1)
    kmeans.fit(df_item_sim)
    
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(df_item_sim, kmeans.predict(df_item_sim)))
In [38]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(20,8))

ax1.plot(ks, inertias,marker='o')
ax1.set_xticks(range(len(ks)+3));
ax1.set_title('Plot of inertia')
ax1.set_xlabel('Number of cluster')
ax1.set_ylabel('Inertia');

# silhoutte plot
ax2.plot(ks,silhouettes,marker='o')
ax2.set_xticks(range(len(ks)+3));
ax2.set_title('Plot of Silhouettes')
ax2.set_xlabel('Number of cluster')
ax2.set_ylabel('Silhouette');

ax2.axvline(15,color='b',ls='--');

Fitting the KMeans model

In [39]:
n_clusters = 15

kmeans = KMeans(n_clusters=n_clusters,random_state=SEED,
                init='k-means++',n_jobs=-1)

kmeans.fit(df_item_sim)
Out[39]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=15, n_init=10, n_jobs=-1, precompute_distances='auto',
       random_state=100, tol=0.0001, verbose=0)
In [40]:
kmeans.labels_
Out[40]:
array([ 7,  0, 14,  1, 12,  5,  3,  5, 11,  9,  1,  0,  8,  5, 13,  2,  3,
        8,  6,  9,  6,  1,  7,  4,  4,  8,  3, 12,  1, 13, 12,  2, 12, 10,
        8,  6,  0,  2,  0,  0,  8, 11,  5,  4, 11,  3,  5,  2],
      dtype=int32)

Reduce dimension using PCA and visualize

In [41]:
from sklearn.decomposition import PCA

pca = PCA()
pca
Out[41]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [42]:
pca = PCA(n_components=2,random_state=SEED)

arr_pca = pca.fit_transform(df_item_sim)
In [43]:
arr_pca.shape
Out[43]:
(48, 2)
In [44]:
df_pca = pd.DataFrame(arr_pca, columns=['pc_0', 'pc_1'],
                     index=df_item_sim.index.to_numpy())

df_pca.head(2)
Out[44]:
pc_0 pc_1
sugar 0.355639 -0.025103
carrots 0.353093 0.000351
In [45]:
kmeans.labels_
Out[45]:
array([ 7,  0, 14,  1, 12,  5,  3,  5, 11,  9,  1,  0,  8,  5, 13,  2,  3,
        8,  6,  9,  6,  1,  7,  4,  4,  8,  3, 12,  1, 13, 12,  2, 12, 10,
        8,  6,  0,  2,  0,  0,  8, 11,  5,  4, 11,  3,  5,  2],
      dtype=int32)
In [46]:
len(kmeans.labels_)
Out[46]:
48
In [47]:
colors_dict = dict(enumerate(sns.color_palette('Set2',n_clusters)))
# colors_dict
In [48]:
mycolors = [colors_dict[k] for k in kmeans.labels_]

# mycolors
In [49]:
colors_dict = dict(enumerate(sns.color_palette('magma',n_clusters)))
mycolors = [colors_dict[k] for k in kmeans.labels_]

fig, ax = plt.subplots(figsize=(20,20))
df_pca.plot.scatter(x='pc_0',y='pc_1',color=mycolors,ax=ax)

i = 0
for name,(x,y) in df_pca.iloc[:,:2].iterrows():
    ax.annotate(name,[x,y], 
                xytext=(10,-5), 
                textcoords='offset points',
                size=24, 
                color=mycolors[i])
    i+=1

Find the clustered groups

In [51]:
df_item_sim.columns
Out[51]:
Index(['sugar', 'carrots', 'cereals', 'shampoo', 'bagels', 'eggs',
       'aluminum foil', 'milk', 'beef', 'laundry detergent', 'shaving cream',
       'lettuce', 'grapefruit', 'cheeses', 'frozen vegetables', 'tea',
       'paper towels', 'cherries', 'spaghetti sauce', 'dishwashing ',
       'canned vegetables', 'hand soap', 'pet items', 'flour', 'pasta',
       'apples', 'toilet paper', 'tortillas', 'soap', 'ice cream',
       'dinner rolls', 'juice', 'sandwich loaves', 'baby items', 'berries',
       'ketchup', 'cucumbers', 'coffee', 'broccoli', 'cauliflower', 'bananas',
       'pork', 'yogurt', 'waffles', 'poultry', 'sandwich bags', 'butter',
       'soda'],
      dtype='object', name='id')
In [52]:
kmeans.labels_
Out[52]:
array([ 7,  0, 14,  1, 12,  5,  3,  5, 11,  9,  1,  0,  8,  5, 13,  2,  3,
        8,  6,  9,  6,  1,  7,  4,  4,  8,  3, 12,  1, 13, 12,  2, 12, 10,
        8,  6,  0,  2,  0,  0,  8, 11,  5,  4, 11,  3,  5,  2],
      dtype=int32)
In [53]:
df_labels = pd.DataFrame({
    'item': df_item_sim.columns.to_numpy(),
    'label': kmeans.labels_
})


df_labels.head(2)
Out[53]:
item label
0 sugar 7
1 carrots 0
In [64]:
pd.set_option('display.max_colwidth', -1)
df_labels.groupby('label')['item'].apply(list).to_frame()
Out[64]:
item
label
0 [carrots, lettuce, cucumbers, broccoli, cauliflower]
1 [shampoo, shaving cream, hand soap, soap]
2 [tea, juice, coffee, soda]
3 [aluminum foil, paper towels, toilet paper, sandwich bags]
4 [flour, pasta, waffles]
5 [eggs, milk, cheeses, yogurt, butter]
6 [spaghetti sauce, canned vegetables, ketchup]
7 [sugar, pet items]
8 [grapefruit, cherries, apples, berries, bananas]
9 [laundry detergent, dishwashing ]
10 [baby items]
11 [beef, pork, poultry]
12 [bagels, tortillas, dinner rolls, sandwich loaves]
13 [frozen vegetables, ice cream]
14 [cereals]

Time taken

In [65]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole notebook: 0 hr 13 min 39 secs
In [ ]:
import subprocess
subprocess.call(['python', '-m', 'nbconvert', '*.ipynb'])