Description

General Introduction

This project uses the data from LendingClub website data which was collected by Kaggle user Wendy Kan.

Wikipedia Introduction

LendingClub is an American peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform. The company claims that \$15.98 billion in loans had been originated through its platform up to December 31, 2015.

LendingClub enables borrowers to create unsecured personal loans between 1,000 and 40,000. The standard loan period is three years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.

Dataset Introduction
These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.

NOTE I am using 2007-2014 for training data and 2015 as test data.

Imports

In [5]:
%load_ext autoreload
In [6]:
%autoreload 2
In [7]:
# my personal library
from bhishan import bp
/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [ ]:
 
In [8]:
import numpy as np
import pandas as pd
import seaborn as sns
pd.plotting.register_matplotlib_converters()

from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot') 


[(x.__name__,x.__version__) for x in [np,pd,sns]]
Out[8]:
[('numpy', '1.18.1'), ('pandas', '1.0.1'), ('seaborn', '0.9.0')]
In [9]:
pd.options.display.max_columns=None
In [10]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
In [11]:
import sys # sys.getsizeof(df)

Load the data

In [12]:
!du -sh ../data/raw/loan_data_2007_2014.csv
229M	../data/raw/loan_data_2007_2014.csv
In [13]:
# our data is small, we can use pandas instead of pyspark.
In [14]:
!head -1 ../data/raw/loan_data_2007_2014.csv

In [18]:
ifile = '../data/raw/loan_data_2007_2014.csv'
tmp = pd.read_csv(ifile,nrows=5)

print(tmp.shape)
tmp
(5, 75)
Out[18]:
Unnamed: 0 id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 0 1077501 1296599 5000 5000 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I need to upgra... credit_card Computer 860xx AZ 27.65 0.0 Jan-85 1.0 NaN NaN 3.0 0.0 13648 83.7 9.0 f 0.0 0.0 5861.071414 5831.78 5000.00 861.07 0.00 0.00 0.00 Jan-15 171.62 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1077430 1314167 2500 2500 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/22/11 > I plan to use t... car bike 309xx GA 1.00 0.0 Apr-99 5.0 NaN NaN 3.0 0.0 1687 9.4 4.0 f 0.0 0.0 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-13 119.66 NaN Sep-13 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 1077175 1313524 2400 2400 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... NaN small_business real estate business 606xx IL 8.72 0.0 Nov-01 2.0 NaN NaN 2.0 0.0 2956 98.5 10.0 f 0.0 0.0 3003.653644 3003.65 2400.00 603.65 0.00 0.00 0.00 Jun-14 649.91 NaN Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 1076863 1277178 10000 10000 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-11 Fully Paid n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > to pay for prop... other personel 917xx CA 20.00 0.0 Feb-96 1.0 35.0 NaN 10.0 0.0 5598 21.0 37.0 f 0.0 0.0 12226.302210 12226.30 10000.00 2209.33 16.97 0.00 0.00 Jan-15 357.48 NaN Jan-15 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 1075358 1311748 3000 3000 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-11 Current n https://www.lendingclub.com/browse/loanDetail.... Borrower added on 12/21/11 > I plan on combi... other Personal 972xx OR 17.94 0.0 Jan-96 0.0 38.0 NaN 15.0 0.0 27783 53.9 38.0 f 766.9 766.9 3242.170000 3242.17 2233.10 1009.07 0.00 0.00 0.00 Jan-16 67.79 Feb-16 Jan-16 0.0 NaN 1 INDIVIDUAL NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [20]:
# for eda purpose, we dont need all columns.
# select only few features.

usecols = ['loan_amnt','funded_amnt','term',
          'int_rate','grade','emp_title',
           'emp_length', 'home_ownership', 'annual_inc',
           'issue_d', 'loan_status', 'pymnt_plan',
           'addr_state','dti','verification_status','purpose'
          ]

df = pd.read_csv(ifile,usecols=usecols)

df.head(2)
Out[20]:
loan_amnt funded_amnt term int_rate grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose addr_state dti
0 5000 5000 36 months 10.65 B NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n credit_card AZ 27.65
1 2500 2500 60 months 15.27 C Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n car GA 1.00

Data Visualization

Correlation heatmap

In [21]:
plt.figure(figsize=(20,20))
sns.set_context("paper", font_scale=1)


sns.heatmap(df.assign(
    grade=df.grade.astype('category').cat.codes,
    term=df.term.astype('category').cat.codes,
    emp_l=df.emp_length.astype('category').cat.codes,
    ver =df.verification_status.astype('category').cat.codes,           home=df.home_ownership.astype('category').cat.codes,                 purp=df.purpose.astype('category').cat.codes
).corr(), 
annot=True, cmap='bwr',vmin=-1,
            vmax=1, square=True,
            linewidths=0.5)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x122e3e6d8>

Continuous Variables

In [22]:
df.head(2)
Out[22]:
loan_amnt funded_amnt term int_rate grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose addr_state dti
0 5000 5000 36 months 10.65 B NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n credit_card AZ 27.65
1 2500 2500 60 months 15.27 C Ryder < 1 year RENT 30000.0 Source Verified Dec-11 Charged Off n car GA 1.00
In [23]:
bp.plot_num(df, 'loan_amnt')
count mean std min 25% 50% 75% max
loan_amnt 466285.0 14317.277577 8286.509164 500.0 8000.0 12000.0 20000.0 35000.0
In [24]:
bp.plot_num(df, 'funded_amnt')
count mean std min 25% 50% 75% max
funded_amnt 466285.0 14291.801044 8274.3713 500.0 8000.0 12000.0 20000.0 35000.0
In [25]:
bp.plot_num(df, 'annual_inc',xlim=[0,0.2e8])
count mean std min 25% 50% 75% max
annual_inc 466281.0 73277.38147 54963.568654 1896.0 45000.0 63000.0 88960.0 7500000.0
In [26]:
bp.plot_num(df, 'dti')
count mean std min 25% 50% 75% max
dti 466285.0 17.218758 7.851121 0.0 11.36 16.87 22.78 39.99

Categorical Variables

In [27]:
df.select_dtypes('object').head(2)
Out[27]:
term grade emp_title emp_length home_ownership verification_status issue_d loan_status pymnt_plan purpose addr_state
0 36 months B NaN 10+ years RENT Verified Dec-11 Fully Paid n credit_card AZ
1 60 months C Ryder < 1 year RENT Source Verified Dec-11 Charged Off n car GA
In [28]:
df.select_dtypes('object').nunique().sort_values()
Out[28]:
term                        2
pymnt_plan                  2
verification_status         3
home_ownership              6
grade                       7
loan_status                 9
emp_length                 11
purpose                    14
addr_state                 50
issue_d                    91
emp_title              205475
dtype: int64
In [29]:
bp.plot_cat(df,'term')
==================================================
Feature: **term**
Overall Count: 
     36 months: 72.48%
     60 months: 27.52%
In [30]:
bp.plot_cat(df, 'grade')
==================================================
Feature: **grade**
Overall Count: 
    B: 29.37%
    C: 26.87%
    D: 16.49%
    A: 16.06%
    E: 7.67%
    F: 2.84%
    G: 0.71%

Bi-variate Analysis

In [31]:
df['loan_status'].unique()
Out[31]:
array(['Fully Paid', 'Charged Off', 'Current', 'Default',
       'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)
In [32]:
bad = ['Late (31-120 days)',
       'Charged Off',
       'Default',
       'Does not meet the credit policy. Status:Charged Off']
df['good_bad'] = np.where(df['loan_status'].isin(bad),0,1)
In [33]:
bp.plot_num_cat(df,'annual_inc','good_bad')
count mean std min 25% 50% 75% max
good_bad
good_bad_0 50968.0 65453.94 41188.09 2000.0 40693.86 56300.0 79000.0 1250000.0
good_bad_1 415313.0 74237.49 56348.16 1896.0 45000.00 64000.0 90000.0 7500000.0
annual_inc 466281.0 73277.38 54963.57 1896.0 45000.00 63000.0 88960.0 7500000.0
In [34]:
bp.plot_cat_num(df,'good_bad','annual_inc')

Map Visualization

In [35]:
df.head(1)
Out[35]:
loan_amnt funded_amnt term int_rate grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose addr_state dti good_bad
0 5000 5000 36 months 10.65 B NaN 10+ years RENT 24000.0 Verified Dec-11 Fully Paid n credit_card AZ 27.65 1
In [39]:
df1 = df.groupby('addr_state')['good_bad'].sum().sort_values().reset_index()
df1.head(2)
Out[39]:
addr_state good_bad
0 ME 4
1 NE 6
In [42]:
df1.plot.barh(x='addr_state',y='good_bad',figsize=(12,18))
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x126baf0b8>
In [43]:
[i for i in dir(bp) if 'map' in i]
Out[43]:
['get_mapbox_access_token',
 'map_plot_bokeh',
 'plotly_corr_heatmap',
 'plotly_mapbox',
 'plotly_usa_bubble_map',
 'plotly_usa_map',
 'plotly_usa_map2']
In [44]:
# help(bp.plotly_usa_map)
Help on function plotly_usa_map in module bhishan.plot_map:

plotly_usa_map(df, col_state, col_value, col_text=None, colorscale='Viridis', reversescale=False, title=None, width=800, height=800, ofile=None, show=True, auto_open=False)
    Plotly map plot for different states of USA.
    
    Parameters
    -----------
    df: pandas.DataFrame
        Input data.
    col_state: str
        Name of State column. eg. OH
    col_value: str
        Name of Value column. e.g. mean_salary
    col_text: str
        Name of Text column.
    colorscale: str
        Valid plotly colorscale name.
        The default colorscale is ‘RdBu’.
        https://plot.ly/python/builtin-colorscales/
        One of the following colorscales:
        ['Reds','Greens','Blues','Greys',
        'Viridis','Earth','Rainbow',
        'Blackbody','Bluered','Electric',
        'Hot','Jet',
        'Picnic','Portland',
        'RdBu','YlGnBu','YlOrRd']
    reversescale: bool
        Whether or not to reverse the colorscale.
    title: str
        Title of the plot.
    width: int
        Width of the map.
    height: int
        Height of the map.
    ofile: str
        Name of the output file.
    show: bool
        Whether or not to show the rendered html in notebook.
    auto_open: bool
        Whether or not to automatically open the ouput html file.
    
    Example 1
    -----------
    df = pd.DataFrame({'state': ['NY', 'OH','MI','CA','TX'],
                    'value': [100,200,300,400,500]})
    plotly_usa_map(df,'state','value')

In [49]:
bp.plotly_usa_map(df1,'addr_state','good_bad',colorscale='RdBu',
                 title='Number of good borrowers per State')
In [ ]: