Description¶

General Introduction

This project uses the data from LendingClub website data which was collected by Kaggle user Wendy Kan.

Wikipedia Introduction

https://www.wikiwand.com/en/LendingClub

LendingClub is an American peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform. The company claims that \$15.98 billion in loans had been originated through its platform up to December 31, 2015.

LendingClub enables borrowers to create unsecured personal loans between 1,000 and 40,000. The standard loan period is three years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.

Dataset Introduction
These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.

NOTE I am using 2007-2014 for training data and 2015 as test data.

Imports¶

%load_ext autoreload

%autoreload 2

# my personal library
from bhishan import bp

/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

import numpy as np
import pandas as pd
import seaborn as sns
pd.plotting.register_matplotlib_converters()

from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot') 


[(x.__name__,x.__version__) for x in [np,pd,sns]]

[('numpy', '1.18.1'), ('pandas', '1.0.1'), ('seaborn', '0.9.0')]

pd.options.display.max_columns=None

%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

import sys # sys.getsizeof(df)

Load the data¶

!du -sh ../data/raw/loan_data_2007_2014.csv

229M	../data/raw/loan_data_2007_2014.csv

# our data is small, we can use pandas instead of pyspark.

!head -1 ../data/raw/loan_data_2007_2014.csv

ifile = '../data/raw/loan_data_2007_2014.csv'
tmp = pd.read_csv(ifile,nrows=5)

print(tmp.shape)
tmp

(5, 75)

# for eda purpose, we dont need all columns.
# select only few features.

usecols = ['loan_amnt','funded_amnt','term',
          'int_rate','grade','emp_title',
           'emp_length', 'home_ownership', 'annual_inc',
           'issue_d', 'loan_status', 'pymnt_plan',
           'addr_state','dti','verification_status','purpose'
          ]

df = pd.read_csv(ifile,usecols=usecols)

df.head(2)

Data Visualization¶

Correlation heatmap¶

plt.figure(figsize=(20,20))
sns.set_context("paper", font_scale=1)


sns.heatmap(df.assign(
    grade=df.grade.astype('category').cat.codes,
    term=df.term.astype('category').cat.codes,
    emp_l=df.emp_length.astype('category').cat.codes,
    ver =df.verification_status.astype('category').cat.codes,           home=df.home_ownership.astype('category').cat.codes,                 purp=df.purpose.astype('category').cat.codes
).corr(), 
annot=True, cmap='bwr',vmin=-1,
            vmax=1, square=True,
            linewidths=0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x122e3e6d8>

Continuous Variables¶

df.head(2)

bp.plot_num(df, 'loan_amnt')

bp.plot_num(df, 'funded_amnt')

bp.plot_num(df, 'annual_inc',xlim=[0,0.2e8])

bp.plot_num(df, 'dti')

Categorical Variables¶

df.select_dtypes('object').head(2)

df.select_dtypes('object').nunique().sort_values()

term                        2
pymnt_plan                  2
verification_status         3
home_ownership              6
grade                       7
loan_status                 9
emp_length                 11
purpose                    14
addr_state                 50
issue_d                    91
emp_title              205475
dtype: int64

bp.plot_cat(df,'term')

==================================================
Feature: **term**
Overall Count: 
     36 months: 72.48%
     60 months: 27.52%

bp.plot_cat(df, 'grade')

==================================================
Feature: **grade**
Overall Count: 
    B: 29.37%
    C: 26.87%
    D: 16.49%
    A: 16.06%
    E: 7.67%
    F: 2.84%
    G: 0.71%

Bi-variate Analysis¶

df['loan_status'].unique()

array(['Fully Paid', 'Charged Off', 'Current', 'Default',
       'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)

bad = ['Late (31-120 days)',
       'Charged Off',
       'Default',
       'Does not meet the credit policy. Status:Charged Off']
df['good_bad'] = np.where(df['loan_status'].isin(bad),0,1)

bp.plot_num_cat(df,'annual_inc','good_bad')

bp.plot_cat_num(df,'good_bad','annual_inc')

Map Visualization¶

df.head(1)

df1 = df.groupby('addr_state')['good_bad'].sum().sort_values().reset_index()
df1.head(2)

df1.plot.barh(x='addr_state',y='good_bad',figsize=(12,18))

<matplotlib.axes._subplots.AxesSubplot at 0x126baf0b8>

[i for i in dir(bp) if 'map' in i]

['get_mapbox_access_token',
 'map_plot_bokeh',
 'plotly_corr_heatmap',
 'plotly_mapbox',
 'plotly_usa_bubble_map',
 'plotly_usa_map',
 'plotly_usa_map2']

# help(bp.plotly_usa_map)

Help on function plotly_usa_map in module bhishan.plot_map:

plotly_usa_map(df, col_state, col_value, col_text=None, colorscale='Viridis', reversescale=False, title=None, width=800, height=800, ofile=None, show=True, auto_open=False)
    Plotly map plot for different states of USA.
    
    Parameters
    -----------
    df: pandas.DataFrame
        Input data.
    col_state: str
        Name of State column. eg. OH
    col_value: str
        Name of Value column. e.g. mean_salary
    col_text: str
        Name of Text column.
    colorscale: str
        Valid plotly colorscale name.
        The default colorscale is ‘RdBu’.
        https://plot.ly/python/builtin-colorscales/
        One of the following colorscales:
        ['Reds','Greens','Blues','Greys',
        'Viridis','Earth','Rainbow',
        'Blackbody','Bluered','Electric',
        'Hot','Jet',
        'Picnic','Portland',
        'RdBu','YlGnBu','YlOrRd']
    reversescale: bool
        Whether or not to reverse the colorscale.
    title: str
        Title of the plot.
    width: int
        Width of the map.
    height: int
        Height of the map.
    ofile: str
        Name of the output file.
    show: bool
        Whether or not to show the rendered html in notebook.
    auto_open: bool
        Whether or not to automatically open the ouput html file.
    
    Example 1
    -----------
    df = pd.DataFrame({'state': ['NY', 'OH','MI','CA','TX'],
                    'value': [100,200,300,400,500]})
    plotly_usa_map(df,'state','value')

bp.plotly_usa_map(df1,'addr_state','good_bad',colorscale='RdBu',
                 title='Number of good borrowers per State')

	Unnamed: 0	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	pymnt_plan	url	desc	purpose	title	zip_code	addr_state	dti	earliest_cr_line	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	revol_bal	revol_util	total_acc	initial_list_status	out_prncp	out_prncp_inv	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_d	last_pymnt_amnt	next_pymnt_d	last_credit_pull_d	mths_since_last_major_derog	policy_code	application_type	annual_inc_joint	dti_joint	verification_status_joint	tot_coll_amt	tot_cur_bal	open_acc_6m	open_il_6m	open_il_12m	open_il_24m	mths_since_rcnt_il	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m
0	0	1077501	1296599	5000	5000	4975.0	36 months	10.65	162.87	B	B2	NaN	10+ years	RENT	24000.0	Verified	Dec-11	Fully Paid	n	https://www.lendingclub.com/browse/loanDetail....	Borrower added on 12/22/11 > I need to upgra...	credit_card	Computer	860xx	AZ	27.65	Jan-85	1.0	NaN	NaN	3.0	13648	83.7	9.0	f	0.0	0.0	5861.071414	5831.78	5000.00	861.07	0.00	0.00	0.00	Jan-15	171.62	NaN	Jan-16	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	1077430	1314167	2500	2500	2500.0	60 months	15.27	59.83	C	C4	Ryder	< 1 year	RENT	30000.0	Source Verified	Dec-11	Charged Off	n	https://www.lendingclub.com/browse/loanDetail....	Borrower added on 12/22/11 > I plan to use t...	car	bike	309xx	GA	1.00	Apr-99	5.0	NaN	NaN	3.0	1687	9.4	4.0	f	0.0	0.0	1008.710000	1008.71	456.46	435.17	0.00	117.08	1.11	Apr-13	119.66	NaN	Sep-13	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2	1077175	1313524	2400	2400	2400.0	36 months	15.96	84.33	C	C5	NaN	10+ years	RENT	12252.0	Not Verified	Dec-11	Fully Paid	n	https://www.lendingclub.com/browse/loanDetail....	NaN	small_business	real estate business	606xx	IL	8.72	Nov-01	2.0	NaN	NaN	2.0	2956	98.5	10.0	f	0.0	0.0	3003.653644	3003.65	2400.00	603.65	0.00	0.00	0.00	Jun-14	649.91	NaN	Jan-16	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3	1076863	1277178	10000	10000	10000.0	36 months	13.49	339.31	C	C1	AIR RESOURCES BOARD	10+ years	RENT	49200.0	Source Verified	Dec-11	Fully Paid	n	https://www.lendingclub.com/browse/loanDetail....	Borrower added on 12/21/11 > to pay for prop...	other	personel	917xx	CA	20.00	Feb-96	1.0	35.0	NaN	10.0	5598	21.0	37.0	f	0.0	0.0	12226.302210	12226.30	10000.00	2209.33	16.97	0.00	0.00	Jan-15	357.48	NaN	Jan-15	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	4	1075358	1311748	3000	3000	3000.0	60 months	12.69	67.79	B	B5	University Medical Group	1 year	RENT	80000.0	Source Verified	Dec-11	Current	n	https://www.lendingclub.com/browse/loanDetail....	Borrower added on 12/21/11 > I plan on combi...	other	Personal	972xx	OR	17.94	Jan-96	0.0	38.0	NaN	15.0	27783	53.9	38.0	f	766.9	766.9	3242.170000	3242.17	2233.10	1009.07	0.00	0.00	0.00	Jan-16	67.79	Feb-16	Jan-16	NaN	1	INDIVIDUAL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	loan_amnt	funded_amnt	term	int_rate	grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	pymnt_plan	purpose	addr_state	dti
0	5000	5000	36 months	10.65	B	NaN	10+ years	RENT	24000.0	Verified	Dec-11	Fully Paid	n	credit_card	AZ	27.65
1	2500	2500	60 months	15.27	C	Ryder	< 1 year	RENT	30000.0	Source Verified	Dec-11	Charged Off	n	car	GA	1.00

	loan_amnt	funded_amnt	term	int_rate	grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	pymnt_plan	purpose	addr_state	dti
0	5000	5000	36 months	10.65	B	NaN	10+ years	RENT	24000.0	Verified	Dec-11	Fully Paid	n	credit_card	AZ	27.65
1	2500	2500	60 months	15.27	C	Ryder	< 1 year	RENT	30000.0	Source Verified	Dec-11	Charged Off	n	car	GA	1.00

	count	mean	std	min	25%	50%	75%	max
good_bad
good_bad_0	50968.0	65453.94	41188.09	2000.0	40693.86	56300.0	79000.0	1250000.0
good_bad_1	415313.0	74237.49	56348.16	1896.0	45000.00	64000.0	90000.0	7500000.0
annual_inc	466281.0	73277.38	54963.57	1896.0	45000.00	63000.0	88960.0	7500000.0

Table of Contents