Problem Description¶

Goal¶

Online shops often sell tons of different items and this can become very messy very quickly!

Data science can be extremely useful to automatically organize the products in categories so that they can be easily found by the customers.

The goal of this challenge is to look at user purchase history and create categories of items that are likely to be bought together and, therefore, should belong to the same section.

Challenge Description¶

Company XYZ is an online grocery store. In the current version of the website, they have manually grouped the items into a few categories based on their experience.

However, they now have a lot of data about user purchase history. Therefore, they would like to put the data into use!

This is what they asked you to do:

The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder:

a. The customer who bought the most items overall in her lifetime

b. For each item, the customer who bought that product the most

Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.

Imports¶

import time
time_start_notebook = time.time()

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

pd.options.display.max_columns = 100

SEED = 100
np.random.seed(SEED) # we need this in each cell that calls random

plt.style.use('ggplot')
%matplotlib inline

%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

Load the data¶

!ls data/

item_to_id.csv       purchase_history.csv

df_item = pd.read_csv('data/item_to_id.csv')

print(df_item.shape)
df_item.head(2).append(df_item.tail(2))

(48, 2)

df_item['Item_id'].nunique(), df_item.shape[0] # we have 48 different item_id

(48, 48)

df_item['Item_name'].nunique(), df_item.shape[0] # we have 48 different item_id

(48, 48)

df_item.set_index('Item_id',inplace=True)

df_item.head(2)

# df_item['Item_name'].to_dict()

df = pd.read_csv('data/purchase_history.csv')

print(df.shape)
df.head(2).append(df.tail(2)) # we have 39k customers

(39474, 2)

df['user_id'].nunique(), df.shape[0]

(24885, 39474)

# user_id is not unique, same user has shopped more than once.

df['id'] = df['id'].astype(str).str.split(',')

df.head(2)

df = df.explode('id')

df = pd.crosstab(df['user_id'], df['id'],margins=True,margins_name='total')
df.head()

df.columns

Index(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2',
       '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30',
       '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41',
       '42', '43', '44', '45', '46', '47', '48', '5', '6', '7', '8', '9',
       'total'],
      dtype='object', name='id')

# df_item['Item_name'].to_dict()

mapping = df_item['Item_name'].to_dict()
mapping = {str(k): v for k,v in mapping.items()} # make string
# mapping

df.columns = df.columns.map(mapping).fillna('item_total')
df.head(2)

df.index = 'user_' + df.index.astype(str)

df.head(2).append(df.tail(2))

Qn1: the customer who bought the most items overall in her lifetime¶

df.iloc[:-1].nlargest(5,'item_total').iloc[:,-1:]

"""
This is the list of top5 users who bought the most items.
""";

Qn2: for each item, the customer who bought that product the most¶

df.head(2).append(df.tail(2))

df.iloc[:-1].T.idxmax(axis=1).to_frame('user').head()

df.iloc[:-1].apply(lambda s: pd.Series(
    [s.idxmax(), s.max()],
    index=['max_user','max_count']
    )).T.head()

# validate the result
df.head(2)

df.loc['user_31625'].to_frame().T # this user has bought 4 sugars in total

df[['sugar']].T.filter(regex='user_31625')

Qn3: Modelling Clustering of Similar Items¶

Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster.

from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

df.head(2).append(df.tail(2)).iloc[:,-5:]

df.shape

(24886, 49)

# remove colmun and index named total
df = df.iloc[:-1,:-1]

print(df.shape)

df.head(2).append(df.tail(2)).iloc[:,-5:]

(24885, 48)

similarity matrix¶

from sklearn.preprocessing import normalize

item_norm = normalize(df,axis=0) # normalize each items (NOT users)
item_sim = item_norm.T.dot(item_norm)

df_item_sim = pd.DataFrame(item_sim,
                       index=df.columns,
                       columns=df.columns)

print(df_item_sim.shape)
df_item_sim.head(2)

(48, 48)

Choose number of clusters K¶

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

kmeans = KMeans()
kmeans

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

inertias = []
silhouettes = []

ks = range(2,30)
for k in ks:
    kmeans = KMeans(n_clusters=k,random_state=SEED,
                init='k-means++',n_jobs=-1)
    kmeans.fit(df_item_sim)
    
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(df_item_sim, kmeans.predict(df_item_sim)))

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(20,8))

ax1.plot(ks, inertias,marker='o')
ax1.set_xticks(range(len(ks)+3));
ax1.set_title('Plot of inertia')
ax1.set_xlabel('Number of cluster')
ax1.set_ylabel('Inertia');

# silhoutte plot
ax2.plot(ks,silhouettes,marker='o')
ax2.set_xticks(range(len(ks)+3));
ax2.set_title('Plot of Silhouettes')
ax2.set_xlabel('Number of cluster')
ax2.set_ylabel('Silhouette');

ax2.axvline(15,color='b',ls='--');

Fitting the KMeans model¶

n_clusters = 15

kmeans = KMeans(n_clusters=n_clusters,random_state=SEED,
                init='k-means++',n_jobs=-1)

kmeans.fit(df_item_sim)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=15, n_init=10, n_jobs=-1, precompute_distances='auto',
       random_state=100, tol=0.0001, verbose=0)

kmeans.labels_

array([ 7,  0, 14,  1, 12,  5,  3,  5, 11,  9,  1,  0,  8,  5, 13,  2,  3,
        8,  6,  9,  6,  1,  7,  4,  4,  8,  3, 12,  1, 13, 12,  2, 12, 10,
        8,  6,  0,  2,  0,  0,  8, 11,  5,  4, 11,  3,  5,  2],
      dtype=int32)

Reduce dimension using PCA and visualize¶

from sklearn.decomposition import PCA

pca = PCA()
pca

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

pca = PCA(n_components=2,random_state=SEED)

arr_pca = pca.fit_transform(df_item_sim)

arr_pca.shape

(48, 2)

df_pca = pd.DataFrame(arr_pca, columns=['pc_0', 'pc_1'],
                     index=df_item_sim.index.to_numpy())

df_pca.head(2)

kmeans.labels_

array([ 7,  0, 14,  1, 12,  5,  3,  5, 11,  9,  1,  0,  8,  5, 13,  2,  3,
        8,  6,  9,  6,  1,  7,  4,  4,  8,  3, 12,  1, 13, 12,  2, 12, 10,
        8,  6,  0,  2,  0,  0,  8, 11,  5,  4, 11,  3,  5,  2],
      dtype=int32)

len(kmeans.labels_)

48

colors_dict = dict(enumerate(sns.color_palette('Set2',n_clusters)))
# colors_dict

mycolors = [colors_dict[k] for k in kmeans.labels_]

# mycolors

colors_dict = dict(enumerate(sns.color_palette('magma',n_clusters)))
mycolors = [colors_dict[k] for k in kmeans.labels_]

fig, ax = plt.subplots(figsize=(20,20))
df_pca.plot.scatter(x='pc_0',y='pc_1',color=mycolors,ax=ax)

i = 0
for name,(x,y) in df_pca.iloc[:,:2].iterrows():
    ax.annotate(name,[x,y], 
                xytext=(10,-5), 
                textcoords='offset points',
                size=24, 
                color=mycolors[i])
    i+=1

Find the clustered groups¶

df_item_sim.columns

Index(['sugar', 'carrots', 'cereals', 'shampoo', 'bagels', 'eggs',
       'aluminum foil', 'milk', 'beef', 'laundry detergent', 'shaving cream',
       'lettuce', 'grapefruit', 'cheeses', 'frozen vegetables', 'tea',
       'paper towels', 'cherries', 'spaghetti sauce', 'dishwashing ',
       'canned vegetables', 'hand soap', 'pet items', 'flour', 'pasta',
       'apples', 'toilet paper', 'tortillas', 'soap', 'ice cream',
       'dinner rolls', 'juice', 'sandwich loaves', 'baby items', 'berries',
       'ketchup', 'cucumbers', 'coffee', 'broccoli', 'cauliflower', 'bananas',
       'pork', 'yogurt', 'waffles', 'poultry', 'sandwich bags', 'butter',
       'soda'],
      dtype='object', name='id')

kmeans.labels_

array([ 7,  0, 14,  1, 12,  5,  3,  5, 11,  9,  1,  0,  8,  5, 13,  2,  3,
        8,  6,  9,  6,  1,  7,  4,  4,  8,  3, 12,  1, 13, 12,  2, 12, 10,
        8,  6,  0,  2,  0,  0,  8, 11,  5,  4, 11,  3,  5,  2],
      dtype=int32)

df_labels = pd.DataFrame({
    'item': df_item_sim.columns.to_numpy(),
    'label': kmeans.labels_
})


df_labels.head(2)

pd.set_option('display.max_colwidth', -1)
df_labels.groupby('label')['item'].apply(list).to_frame()

Time taken¶

time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

Time taken to run whole notebook: 0 hr 13 min 39 secs

import subprocess
subprocess.call(['python', '-m', 'nbconvert', '*.ipynb'])

	user_id	id
0	222087	27,26
1	1343649	6,47,17
39472	651509	13,39,20,32,37,1,3,18,34
39473	517744	39,40,20,3,10,42,38,43,28,25,13

	user_id	id
0	222087	[27, 26]
1	1343649	[6, 47, 17]

id	item_total
user_id
user_269335	72
user_367872	70
user_397623	64
user_599172	64
user_377284	63

id	sugar	carrots	cereals	shampoo	bagels	eggs	aluminum foil	milk	beef	laundry detergent	shaving cream	lettuce	grapefruit	cheeses	frozen vegetables	tea	paper towels	cherries	spaghetti sauce	dishwashing	canned vegetables	hand soap	pet items	flour	pasta	apples	toilet paper	tortillas	soap	ice cream	dinner rolls	juice	sandwich loaves	baby items	berries	ketchup	cucumbers	coffee	broccoli	cauliflower	bananas	pork	yogurt	waffles	poultry	sandwich bags	butter	soda
id
sugar	1.000000	0.390641	0.263889	0.260103	0.392767	0.336052	0.258082	0.338954	0.386424	0.338680	0.273667	0.506895	0.391199	0.339366	0.344111	0.397598	0.272608	0.398554	0.351104	0.338670	0.357550	0.270909	0.420145	0.259259	0.264944	0.395828	0.275207	0.403212	0.262791	0.345662	0.395159	0.401602	0.388034	0.296986	0.390286	0.358599	0.393056	0.395696	0.396766	0.390253	0.394998	0.392164	0.328221	0.271132	0.388250	0.271743	0.335303	0.403690
carrots	0.390641	1.000000	0.232861	0.246780	0.379568	0.323319	0.257784	0.325523	0.386199	0.325983	0.256384	0.527894	0.382947	0.322191	0.335931	0.377047	0.258435	0.384620	0.332691	0.330508	0.343154	0.257511	0.352597	0.236888	0.239542	0.375208	0.261217	0.379041	0.248531	0.324006	0.378677	0.382540	0.381372	0.258313	0.380613	0.333471	0.488734	0.372790	0.483275	0.483897	0.384714	0.381086	0.327807	0.234254	0.376518	0.250765	0.338472	0.381207

	item
label
0	[carrots, lettuce, cucumbers, broccoli, cauliflower]
1	[shampoo, shaving cream, hand soap, soap]
2	[tea, juice, coffee, soda]
3	[aluminum foil, paper towels, toilet paper, sandwich bags]
4	[flour, pasta, waffles]
5	[eggs, milk, cheeses, yogurt, butter]
6	[spaghetti sauce, canned vegetables, ketchup]
7	[sugar, pet items]
8	[grapefruit, cherries, apples, berries, bananas]
9	[laundry detergent, dishwashing ]
10	[baby items]
11	[beef, pork, poultry]
12	[bagels, tortillas, dinner rolls, sandwich loaves]
13	[frozen vegetables, ice cream]
14	[cereals]

id	sugar	carrots	cereals	shampoo	bagels	eggs	aluminum foil	milk	beef	laundry detergent	shaving cream	lettuce	grapefruit	cheeses	frozen vegetables	tea	paper towels	cherries	spaghetti sauce	dishwashing	canned vegetables	hand soap	pet items	flour	pasta	apples	toilet paper	tortillas	soap	ice cream	dinner rolls	juice	sandwich loaves	baby items	berries	ketchup	cucumbers	coffee	broccoli	cauliflower	bananas	pork	yogurt	waffles	poultry	sandwich bags	butter	soda	item_total
user_id
user_47	0	0	0	0	1	0	1	1	0	0	0	1	0	0	0	1	1	1	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	1	1	0	0	0	0	0	0	0	13
user_68	0	1	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	8
user_1499974	0	1	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	1	0	0	0	0	0	0	1	0	0	1	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	1	0	1	1	0	13
user_total	9119	8859	2730	3270	8695	5700	3281	5782	8627	6000	3346	14473	8951	5666	5977	8723	3278	9019	6449	6115	6365	3356	7204	2631	2731	9008	3322	8782	3282	6042	8695	8814	8728	3262	8878	6420	9050	8784	8980	8902	8921	8682	5673	2817	8730	3343	5717	8816	319995

	user
id
sugar	user_31625
carrots	user_618914
cereals	user_367872
shampoo	user_557904
bagels	user_653800

	Item_name
Item_id
43	coffee
23	tea

id	1	10	11	12	13	14	15	16	17	18	19	2	20	21	22	23	24	25	26	27	28	29	3	30	31	32	33	34	35	36	37	38	39	4	40	41	42	43	44	45	46	47	48	5	6	7	8	9	total
user_id
47	0	0	0	0	1	0	1	1	0	0	0	1	0	0	0	1	1	1	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	1	1	0	0	0	0	0	0	0	13
68	0	1	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	8
113	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1	1	0	1	1	0	0	0	0	1	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	1	11
123	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	4
223	1	0	0	0	0	0	0	0	1	0	0	1	0	0	1	0	0	0	0	0	0	1	0	0	0	1	0	0	1	0	0	1	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	11

	pc_0	pc_1
sugar	0.355639	-0.025103
carrots	0.353093	0.000351

Table of Contents