General Introduction
This project uses the data from LendingClub website data which was collected by Kaggle user Wendy Kan.
Wikipedia Introduction
LendingClub is an American peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform. The company claims that \$15.98 billion in loans had been originated through its platform up to December 31, 2015.
LendingClub enables borrowers to create unsecured personal loans between 1,000 and 40,000. The standard loan period is three years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.
Dataset Introduction
These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.
NOTE I am using 2007-2014 for training data and 2015 as test data.
%load_ext autoreload
%autoreload 2
# my personal library
from bhishan import bp
import numpy as np
import pandas as pd
import seaborn as sns
pd.plotting.register_matplotlib_converters()
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot')
[(x.__name__,x.__version__) for x in [np,pd,sns]]
pd.options.display.max_columns=None
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
import sys # sys.getsizeof(df)
!du -sh ../data/raw/loan_data_2007_2014.csv
# our data is small, we can use pandas instead of pyspark.
!head -1 ../data/raw/loan_data_2007_2014.csv
ifile = '../data/raw/loan_data_2007_2014.csv'
tmp = pd.read_csv(ifile,nrows=5)
print(tmp.shape)
tmp
# for eda purpose, we dont need all columns.
# select only few features.
usecols = ['loan_amnt','funded_amnt','term',
'int_rate','grade','emp_title',
'emp_length', 'home_ownership', 'annual_inc',
'issue_d', 'loan_status', 'pymnt_plan',
'addr_state','dti','verification_status','purpose'
]
df = pd.read_csv(ifile,usecols=usecols)
df.head(2)
plt.figure(figsize=(20,20))
sns.set_context("paper", font_scale=1)
sns.heatmap(df.assign(
grade=df.grade.astype('category').cat.codes,
term=df.term.astype('category').cat.codes,
emp_l=df.emp_length.astype('category').cat.codes,
ver =df.verification_status.astype('category').cat.codes, home=df.home_ownership.astype('category').cat.codes, purp=df.purpose.astype('category').cat.codes
).corr(),
annot=True, cmap='bwr',vmin=-1,
vmax=1, square=True,
linewidths=0.5)
df.head(2)
bp.plot_num(df, 'loan_amnt')
bp.plot_num(df, 'funded_amnt')
bp.plot_num(df, 'annual_inc',xlim=[0,0.2e8])
bp.plot_num(df, 'dti')
df.select_dtypes('object').head(2)
df.select_dtypes('object').nunique().sort_values()
bp.plot_cat(df,'term')
bp.plot_cat(df, 'grade')
df['loan_status'].unique()
bad = ['Late (31-120 days)',
'Charged Off',
'Default',
'Does not meet the credit policy. Status:Charged Off']
df['good_bad'] = np.where(df['loan_status'].isin(bad),0,1)
bp.plot_num_cat(df,'annual_inc','good_bad')
bp.plot_cat_num(df,'good_bad','annual_inc')
df.head(1)
df1 = df.groupby('addr_state')['good_bad'].sum().sort_values().reset_index()
df1.head(2)
df1.plot.barh(x='addr_state',y='good_bad',figsize=(12,18))
[i for i in dir(bp) if 'map' in i]
# help(bp.plotly_usa_map)
bp.plotly_usa_map(df1,'addr_state','good_bad',colorscale='RdBu',
title='Number of good borrowers per State')