Table of Contents

Data Description

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Task: Try to estimate the price based on given features.

Imports

Important Scripts

Parameters

Load the data

Data Processing

Log transform Target

Train-target split

Train-Validation Split

Scaling

Modelling: catboost

class CatBoostRegressor(
iterations=None,
learning_rate=None,
loss_function='RMSE',
use_best_model=None,
verbose=None,
silent=None,
logging_level=None,
one_hot_max_size=None,
ignored_features=None,
train_dir=None,
custom_metric=None,
eval_metric=None,
subsample=None,
max_depth=None,
n_estimators=None,
num_boost_round=None,
num_trees=None,
colsample_bylevel=None,
random_state=None,
reg_lambda=None,
objective=None,
eta=None,
max_bin=None,
early_stopping_rounds=None,
cat_features=None,
min_child_samples=None,
max_leaves=None,
num_leaves=None,
score_function=None,
)

Baseline model

Catboost with validation set

catboost built-in categorical features

Feature Statistics

Feature Importance

Metric Visualizer (only works in notebook, not jupyterlab or colab)

import catboost
from catboost import CatBoostClassifier

# part 1: fit the model
cat_features = [0,1,2]

train_data = [["a", "b", 1, 4, 5, 6],
              ["a", "b", 4, 5, 6, 7],
              ["c", "d", 30, 40, 50, 60]]

train_labels = [1,1,0]

model = CatBoostClassifier(iterations=20, 
                           loss_function = "CrossEntropy", 
                           train_dir = "crossentropy")

model.fit(train_data, train_labels, cat_features)
predictions = model.predict(train_data)


# part 2: visualize
w = catboost.MetricVisualizer('/crossentropy/')
w.start()

Part 1 works in google colab and gives some files in the directory crossentroy but part2 keeps running for infinite time.

Model Evaluation Using shap

This plot is made of many dots. Each dot has three characteristics:

Time Taken