Table of Contents

Data Description

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Task: Try to estimate the price based on given features.

Imports

Useful Scripts

Load the data

EDA

Feature Engineering

Temporal Variables

Categorical Features

Boolean Features

Bucketizing Numerical Features

Dummy Variables for Binned and Categorical Features

Log Transform Large Numerical Features

Combine All Features to One Column

Feature Scaling

Train Test Splitting

Modelling

Linear Regression

Model Evaluation

Model Predictions

Model Evaluation Metrics

Model Evaluation using ml.evaluation

https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/evaluation.html

Elastic Net Regression

Decision Tree Regressor

Random Forest Regressor

Parameter Tuning

Feature Importances