01: Connect to Kaggle to download data

  • download kaggle.json from kaggle
In [1]:
# Install Kaggle API
!pip install -q kaggle
In [15]:
from google.colab import files
files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json
Out[15]:
{'kaggle.json': b'{"username":"bhishanpdl","key":"5cff34365e573008bd6e3068cf84005c"}'}
In [16]:
!ls
kaggle.json  sample_data
In [17]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
In [19]:
!kaggle datasets download -d zygmunt/goodbooks-10k --unzip
Dataset URL: https://www.kaggle.com/datasets/zygmunt/goodbooks-10k
License(s): CC-BY-SA-4.0
Downloading goodbooks-10k.zip to /content
  0% 0.00/11.6M [00:00<?, ?B/s]
100% 11.6M/11.6M [00:00<00:00, 1.08GB/s]
In [20]:
!ls
books.csv      ratings.csv	sample_data  to_read.csv
book_tags.csv  sample_book.xml	tags.csv
In [22]:
# we use books.csv and ratings.csv
In [23]:
!cat ratings.csv | head -n 4
book_id,user_id,rating
1,314,5
1,439,3
1,588,5
In [24]:
!cat books.csv | head -n 4
id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
1,2767052,2767052,2792775,272,439023483,9.78043902348e+12,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m/2767052.jpg,https://images.gr-assets.com/books/1447303603s/2767052.jpg
2,3,3,4640799,491,439554934,9.78043955493e+12,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,"Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m/3.jpg,https://images.gr-assets.com/books/1474154022s/3.jpg
3,41865,41865,3212258,226,316015849,9.78031601584e+12,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m/41865.jpg,https://images.gr-assets.com/books/1361039443s/41865.jpg
In [25]:
!cat tags.csv | head -n 4
tag_id,tag_name
0,-
1,--1-
2,--10-
In [26]:
!cat book_tags.csv | head -n 4
goodreads_book_id,tag_id,count
1,30574,167697
1,11305,37174
1,11557,34173
In [26]:

  1. Data Loading and Preparation
In [28]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import os

# Load data
ratings = pd.read_csv('ratings.csv')
books = pd.read_csv('books.csv')

# Prepare data
n_users = ratings['user_id'].nunique()
n_books = ratings['book_id'].nunique()

# Split data
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

# Create Dataset class
class BookRatingDataset(Dataset):
    def __init__(self, df):
        self.users = torch.LongTensor(df['user_id'].values)
        self.books = torch.LongTensor(df['book_id'].values)
        self.ratings = torch.FloatTensor(df['rating'].values)

    def __len__(self):
        return len(self.ratings)

    def __getitem__(self, idx):
        return self.users[idx], self.books[idx], self.ratings[idx]

# Create model
class BookRecommender(nn.Module):
    def __init__(self, n_users, n_books, emb_dim=5):
        super().__init__()
        self.user_emb = nn.Embedding(n_users+1, emb_dim)  # +1 for zero padding
        self.book_emb = nn.Embedding(n_books+1, emb_dim)
        self.user_emb.weight.data.uniform_(0, 0.05)
        self.book_emb.weight.data.uniform_(0, 0.05)

    def forward(self, users, books):
        u = self.user_emb(users)
        b = self.book_emb(books)
        return (u * b).sum(dim=1)  # Dot product

# Initialize
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BookRecommender(n_users, n_books).to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Create data loaders
train_dataset = BookRatingDataset(train)
test_dataset = BookRatingDataset(test)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Training
def train_model(epochs=5):
    model.train()
    for epoch in range(epochs):
        train_loss = 0
        for users, books, ratings in train_loader:
            users, books, ratings = users.to(device), books.to(device), ratings.to(device)

            optimizer.zero_grad()
            preds = model(users, books)
            loss = criterion(preds, ratings)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        print(f'Epoch {epoch+1}, Loss: {train_loss/len(train_loader):.4f}')

train_model()

# Evaluation
def evaluate():
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for users, books, ratings in test_loader:
            users, books, ratings = users.to(device), books.to(device), ratings.to(device)
            preds = model(users, books)
            test_loss += criterion(preds, ratings).item()
    print(f'Test Loss: {test_loss/len(test_loader):.4f}')

evaluate()

# Recommendation function
def make_recommendations_simple(user_id, n_recs=5):
    model.eval()
    all_books = torch.LongTensor(ratings['book_id'].unique()).to(device)
    user_tensor = torch.LongTensor([user_id]*len(all_books)).to(device)

    with torch.no_grad():
        preds = model(user_tensor, all_books)

    # Get top recommendations
    _, indices = torch.topk(preds, n_recs)
    recommended_book_ids = all_books[indices].cpu().numpy()

    # Get book details
    recommendations = books[books['id'].isin(recommended_book_ids)]
    return recommendations[['id', 'title', 'authors']]

def make_recommendations(user_id, n_recs=5):
    # Check for saved model
    checkpoint_path = 'book_recommender_state.pt'
    if os.path.exists(checkpoint_path):
        checkpoint = torch.load(checkpoint_path)
        model = BookRecommender(
            checkpoint['n_users'],
            checkpoint['n_books'],
            emb_dim=checkpoint['emb_dim']
        ).to(device)
        model.load_state_dict(checkpoint['model_state_dict'])
        model.eval()
    else:
        print("No saved model found. Training new model...")
        model = BookRecommender(n_users, n_books).to(device)
        train_model()
        # Save after training
        torch.save({
            'model_state_dict': model.state_dict(),
            'n_users': n_users,
            'n_books': n_books,
            'emb_dim': 5
        }, checkpoint_path)

    # Generate recommendations
    all_books = torch.LongTensor(ratings['book_id'].unique()).to(device)
    user_tensor = torch.LongTensor([user_id]*len(all_books)).to(device)

    with torch.no_grad():
        preds = model(user_tensor, all_books)

    _, indices = torch.topk(preds, n_recs)
    recommended_book_ids = all_books[indices].cpu().numpy()

    recommendations = books[books['id'].isin(recommended_book_ids)]
    return recommendations[['id', 'title', 'authors']]

# Get recommendations for user 314
print("\nTop 5 Recommendations for User 314:")
print(make_recommendations(314))
Epoch 1, Loss: 10.2498
Epoch 2, Loss: 2.4587
Epoch 3, Loss: 1.3194
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-28-72de94ca9793> in <cell line: 0>()
     75         print(f'Epoch {epoch+1}, Loss: {train_loss/len(train_loader):.4f}')
     76 
---> 77 train_model()
     78 
     79 # Evaluation

<ipython-input-28-72de94ca9793> in train_model(epochs)
     68             preds = model(users, books)
     69             loss = criterion(preds, ratings)
---> 70             loss.backward()
     71             optimizer.step()
     72 

/usr/local/lib/python3.11/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    624                 inputs=inputs,
    625             )
--> 626         torch.autograd.backward(
    627             self, gradient, retain_graph, create_graph, inputs=inputs
    628         )

/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    345     # some Python versions print out the first line of a multi-line function
    346     # calls in the traceback and some print out the last line
--> 347     _engine_run_backward(
    348         tensors,
    349         grad_tensors_,

/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py in _engine_run_backward(t_outputs, *args, **kwargs)
    821         unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    822     try:
--> 823         return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    824             t_outputs, *args, **kwargs
    825         )  # Calls into the C++ engine to run the backward pass

KeyboardInterrupt: 

Key Differences from TensorFlow Version:

  1. PyTorch Components:

    • Uses nn.Embedding instead of Keras Embedding layers
    • Implements custom Dataset and DataLoader for batching
    • Manual training loop with explicit gradient zeroing and backpropagation
  2. Model Architecture:

    • Same dot product approach but implemented as (u * b).sum(dim=1)
    • Embedding weights initialized with small random values
  3. Training Process:

    • Explicit batch processing
    • Manual loss calculation and backpropagation
    • Model modes (train() and eval()) for proper dropout/batch norm handling
  4. Recommendation Function:

    • Uses PyTorch's topk() for efficient recommendation selection
    • Moves tensors to GPU if available

How It Works:

  1. Data Preparation:

    • Creates mapping between user/book IDs and embedding indices
    • Splits data into train/test sets
  2. Model Training:

    • Learns embeddings that minimize rating prediction error
    • Uses Adam optimizer and MSE loss (same as TF version)
  3. Making Recommendations:

    • For a given user, predicts ratings for all books
    • Selects top 5 highest predicted ratings
    • Returns book details from the books.csv file

This implementation maintains the same collaborative filtering approach but gives you more low-level control through PyTorch's imperative programming style. The recommendations will be similar in quality to the TensorFlow version.

In PyTorch you can save the trained model, but instead of the .h5 format used by Keras/TensorFlow, PyTorch typically uses .pt or .pth file extensions. Here's how to modify the code to save and load the model:

Key Differences from TensorFlow's .h5:

  1. File Formats:

    • PyTorch: .pt or .pth (pickle-based)
    • TensorFlow: .h5 (HDF5-based)
  2. Saving Options:

    • Entire model: torch.save(model, 'file.pt') (like TF's model.save())
    • State dictionary: model.state_dict() (more flexible)
  3. Loading Requirements:

    • Need the model class definition when loading state_dict
    • Need to call model.eval() for inference
  4. Additional Info:

    • PyTorch often saves optimizer state and other metadata
    • Can save on GPU and load on CPU with map_location parameter

Best Practices:

  1. For production, save state_dict rather than entire model
  2. Include all necessary metadata (like n_users, n_books)
  3. Handle device mapping (GPU/CPU) when loading
  4. Use model.eval() before inference

This implementation gives you the same functionality as the TensorFlow version but with PyTorch's more flexible serialization approach.

In [ ]: