01: Connect to Kaggle to download data¶
- download kaggle.json from kaggle
# Install Kaggle API
!pip install -q kaggle
from google.colab import files
files.upload()
!ls
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d zygmunt/goodbooks-10k --unzip
!ls
# we use books.csv and ratings.csv
!cat ratings.csv | head -n 4
!cat books.csv | head -n 4
!cat tags.csv | head -n 4
!cat book_tags.csv | head -n 4
- Data Loading and Preparation
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import os
# Load data
ratings = pd.read_csv('ratings.csv')
books = pd.read_csv('books.csv')
# Prepare data
n_users = ratings['user_id'].nunique()
n_books = ratings['book_id'].nunique()
# Split data
train, test = train_test_split(ratings, test_size=0.2, random_state=42)
# Create Dataset class
class BookRatingDataset(Dataset):
def __init__(self, df):
self.users = torch.LongTensor(df['user_id'].values)
self.books = torch.LongTensor(df['book_id'].values)
self.ratings = torch.FloatTensor(df['rating'].values)
def __len__(self):
return len(self.ratings)
def __getitem__(self, idx):
return self.users[idx], self.books[idx], self.ratings[idx]
# Create model
class BookRecommender(nn.Module):
def __init__(self, n_users, n_books, emb_dim=5):
super().__init__()
self.user_emb = nn.Embedding(n_users+1, emb_dim) # +1 for zero padding
self.book_emb = nn.Embedding(n_books+1, emb_dim)
self.user_emb.weight.data.uniform_(0, 0.05)
self.book_emb.weight.data.uniform_(0, 0.05)
def forward(self, users, books):
u = self.user_emb(users)
b = self.book_emb(books)
return (u * b).sum(dim=1) # Dot product
# Initialize
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BookRecommender(n_users, n_books).to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Create data loaders
train_dataset = BookRatingDataset(train)
test_dataset = BookRatingDataset(test)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
# Training
def train_model(epochs=5):
model.train()
for epoch in range(epochs):
train_loss = 0
for users, books, ratings in train_loader:
users, books, ratings = users.to(device), books.to(device), ratings.to(device)
optimizer.zero_grad()
preds = model(users, books)
loss = criterion(preds, ratings)
loss.backward()
optimizer.step()
train_loss += loss.item()
print(f'Epoch {epoch+1}, Loss: {train_loss/len(train_loader):.4f}')
train_model()
# Evaluation
def evaluate():
model.eval()
test_loss = 0
with torch.no_grad():
for users, books, ratings in test_loader:
users, books, ratings = users.to(device), books.to(device), ratings.to(device)
preds = model(users, books)
test_loss += criterion(preds, ratings).item()
print(f'Test Loss: {test_loss/len(test_loader):.4f}')
evaluate()
# Recommendation function
def make_recommendations_simple(user_id, n_recs=5):
model.eval()
all_books = torch.LongTensor(ratings['book_id'].unique()).to(device)
user_tensor = torch.LongTensor([user_id]*len(all_books)).to(device)
with torch.no_grad():
preds = model(user_tensor, all_books)
# Get top recommendations
_, indices = torch.topk(preds, n_recs)
recommended_book_ids = all_books[indices].cpu().numpy()
# Get book details
recommendations = books[books['id'].isin(recommended_book_ids)]
return recommendations[['id', 'title', 'authors']]
def make_recommendations(user_id, n_recs=5):
# Check for saved model
checkpoint_path = 'book_recommender_state.pt'
if os.path.exists(checkpoint_path):
checkpoint = torch.load(checkpoint_path)
model = BookRecommender(
checkpoint['n_users'],
checkpoint['n_books'],
emb_dim=checkpoint['emb_dim']
).to(device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
else:
print("No saved model found. Training new model...")
model = BookRecommender(n_users, n_books).to(device)
train_model()
# Save after training
torch.save({
'model_state_dict': model.state_dict(),
'n_users': n_users,
'n_books': n_books,
'emb_dim': 5
}, checkpoint_path)
# Generate recommendations
all_books = torch.LongTensor(ratings['book_id'].unique()).to(device)
user_tensor = torch.LongTensor([user_id]*len(all_books)).to(device)
with torch.no_grad():
preds = model(user_tensor, all_books)
_, indices = torch.topk(preds, n_recs)
recommended_book_ids = all_books[indices].cpu().numpy()
recommendations = books[books['id'].isin(recommended_book_ids)]
return recommendations[['id', 'title', 'authors']]
# Get recommendations for user 314
print("\nTop 5 Recommendations for User 314:")
print(make_recommendations(314))
Key Differences from TensorFlow Version:¶
PyTorch Components:
- Uses
nn.Embedding
instead of Keras Embedding layers - Implements custom
Dataset
andDataLoader
for batching - Manual training loop with explicit gradient zeroing and backpropagation
- Uses
Model Architecture:
- Same dot product approach but implemented as
(u * b).sum(dim=1)
- Embedding weights initialized with small random values
- Same dot product approach but implemented as
Training Process:
- Explicit batch processing
- Manual loss calculation and backpropagation
- Model modes (
train()
andeval()
) for proper dropout/batch norm handling
Recommendation Function:
- Uses PyTorch's
topk()
for efficient recommendation selection - Moves tensors to GPU if available
- Uses PyTorch's
How It Works:¶
Data Preparation:
- Creates mapping between user/book IDs and embedding indices
- Splits data into train/test sets
Model Training:
- Learns embeddings that minimize rating prediction error
- Uses Adam optimizer and MSE loss (same as TF version)
Making Recommendations:
- For a given user, predicts ratings for all books
- Selects top 5 highest predicted ratings
- Returns book details from the books.csv file
This implementation maintains the same collaborative filtering approach but gives you more low-level control through PyTorch's imperative programming style. The recommendations will be similar in quality to the TensorFlow version.
In PyTorch you can save the trained model, but instead of the .h5
format used by Keras/TensorFlow, PyTorch typically uses .pt
or .pth
file extensions. Here's how to modify the code to save and load the model:
Key Differences from TensorFlow's .h5:¶
File Formats:
- PyTorch:
.pt
or.pth
(pickle-based) - TensorFlow:
.h5
(HDF5-based)
- PyTorch:
Saving Options:
- Entire model:
torch.save(model, 'file.pt')
(like TF's model.save()) - State dictionary:
model.state_dict()
(more flexible)
- Entire model:
Loading Requirements:
- Need the model class definition when loading state_dict
- Need to call
model.eval()
for inference
Additional Info:
- PyTorch often saves optimizer state and other metadata
- Can save on GPU and load on CPU with
map_location
parameter
Best Practices:¶
- For production, save
state_dict
rather than entire model - Include all necessary metadata (like n_users, n_books)
- Handle device mapping (GPU/CPU) when loading
- Use
model.eval()
before inference
This implementation gives you the same functionality as the TensorFlow version but with PyTorch's more flexible serialization approach.