Table of Contents

Data Description

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Imports

Useful Scripts

Load the data

Data Processing and Feature Engineering

Date time features

Categorical Features

Ref: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

Boolean data types

Numerical features binning

Create dummy variables from object and categories

Log transformation of large numerical values

Log transformations make the features more Gaussian-like and linear models may give better performance.

bedrooms, bathrooms, floors, waterfront, view, condition, grade are categorical columns but we can also use them as numerical columns to see how to model performs.

lat and long are geo coordinates, they are also categorical and sometimes treated as numerical data.

basement = living - above is redundant variable, may be used or not. This is just a choice of feature engineering.

year columns can be converted to age columns and ages can be binned, however, they also can be treated as number and used in the model.

features with large values can be log1p transformed.

Drop unwanted columns

Save clean data

For Modelling

Time Taken