This project uses the consumer complaint database.
The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database. The database generally updates daily.
oad the serialized object make sure you have the same conda environment as it was when creating the serialized object. </div>
Term Frequency : This gives how often a given word appears within a document.
$\mathrm{TF}=\frac{\text { Number of times the term appears in the doc }}{\text { Total number of words in the doc }}$
Inverse Document Frequency: This gives how often the word appers across the documents. If a term is very common among documents (e.g., “the”, “a”, “is”), then we have low IDF score.
$\mathrm{IDF}=\ln \left(\frac{\text { Number of docs }}{\text { Number docs the term appears in }}\right)$
Term Frequency – Inverse Document Frequency TF-IDF: TF-IDF is the product of the TF and IDF scores of the term.
$\mathrm{TF}\mathrm{IDF}=\mathrm{TF} * \mathrm{IDF}$
In machine learning, TF-IDF is obtained from the class TfidfVectorizer
.
It has following parameters:
min_df
: remove the words from the vocabulary which have occurred in less than "min_df"
number of files.max_df
: remove the words from the vocabulary which have occurred in more than _{ maxdf" }
total number of files in corpus.sublinear_tf
: set to True to scale the term frequency in logarithmic scale.stop_words
: remove the predefined stop words in 'english':use_idf
: weight factor must use inverse document frequency.ngram_range
: (1,2) to indicate that unigrams and bigrams will be considered.NOTE:
TF
is same in sklearn and textbook but IDF
if different (to address divide by zero problem)Ref: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
Here, df(t)
is is the number of documents in the document set that contain term t in it.
import time
time_start_notebook = time.time()
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import tqdm
import matplotlib.pyplot as plt
# local scripts
import util
import config
ifile = config.clean_data_path
SEED = config.SEED
model_linsvc_tfidf_path = config.model_linsvc_tfidf_path
tfidf_fitted_vec_path = config.tfidf_fitted_vec_path
N_SAMPLES = config.N_SAMPLES
compression= config.compression
# settings
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot')
pd.options.display.max_colwidth = 200
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import joblib
#Visualizers
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
# versions
import watermark
%load_ext watermark
%watermark -a "Bhishan Poudel" -d -v -m
print()
%watermark -iv
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-88-f85df68dbbcf> in <module> 18 model_linsvc_tfidf_path = config.model_linsvc_tfidf_path 19 tfidf_fitted_vec_path = config.tfidf_fitted_vec_path ---> 20 N_SAMPLES = config.N_SAMPLES 21 22 compression= config.compression AttributeError: module 'config' has no attribute 'N_SAMPLES'
def show_methods(obj, ncols=4):
lst = [i for i in dir(obj) if i[0]!='_' ]
df = pd.DataFrame(np.array_split(lst,ncols)).T.fillna('')
return df
!ls ../data
complaints_2019.csv.zip complaints_2019_clean.csv.zip orig_data_head_tail.csv
df = pd.read_csv('../data/complaints_2019_clean.csv.zip',compression='zip')
# make data small
df = df.sample(n=N_SAMPLES, random_state=SEED)
df.head(2).append(df.tail(2))
product | complaint | complaint_lst_clean | complaint_clean | total_length | num_words | num_sent | num_unique_words | avg_word_len | avg_unique | |
---|---|---|---|---|---|---|---|---|---|---|
82392 | Student loan | On XX/XX/2019 I sent a dispute letter to Fed Loan Servicing about the student loans they claim I owe. I asked them to send me verifiable information for the accounts and the information that they ... | ['sent', 'dispute', 'letter', 'fed', 'loan', 'servicing', 'student', 'loan', 'claim', 'owe', 'asked', 'send', 'verifiable', 'information', 'account', 'information', 'sent', 'constitute', 'sent', '... | sent dispute letter fed loan servicing student loan claim owe asked send verifiable information account information sent constitute sent promissory note school lot information redacted supposed do... | 970 | 172 | 1 | 97 | 4.645349 | 0.563953 |
1435 | Credit reporting, credit repair services, or other personal consumer reports | Someone applied for a vehicle in my name and now it is reflecting on my credit report and this is not my account | ['someone', 'applied', 'vehicle', 'name', 'reflecting', 'credit', 'report', 'account'] | someone applied vehicle name reflecting credit report account | 112 | 23 | 1 | 19 | 3.913043 | 0.826087 |
13448 | Credit reporting, credit repair services, or other personal consumer reports | My exwife opened a XXXX Credit card in 2009 ( 3 years before we ever met ). Shortly after we met, she added me as an authorized user and I never even had a card. The three credit reporting agencie... | ['exwife', 'opened', 'credit', 'card', 'year', 'ever', 'met', 'shortly', 'met', 'added', 'authorized', 'user', 'never', 'even', 'card', 'three', 'credit', 'reporting', 'agency', 'claiming', 'joint... | exwife opened credit card year ever met shortly met added authorized user never even card three credit reporting agency claiming jointly owned account filed bankruptcy im responsible debt card nev... | 601 | 117 | 1 | 79 | 4.145299 | 0.675214 |
61809 | Credit reporting, credit repair services, or other personal consumer reports | AFTER RECEIVING A CURRENT COPY OF MY CREDIT REPORT, I DISCOVERED SOME ENTRIES THAT WERE IDENITIFIED AS INQUIRIES WHICH QUALIFIED FOR DELETION FROM MY REPORT. | ['receiving', 'current', 'copy', 'credit', 'report', 'discovered', 'entry', 'idenitified', 'inquiry', 'qualified', 'deletion', 'report'] | receiving current copy credit report discovered entry idenitified inquiry qualified deletion report | 157 | 25 | 1 | 24 | 5.320000 | 0.960000 |
maincol = 'complaint'
mc = maincol + '_clean'
target = 'product'
df['product_orig'] = df['product']
df['product'] = df['product'].astype('category').cat.codes
%%time
from sklearn.model_selection import train_test_split
X = df['complaint_clean'] # documents
y = df['product'].astype('category').cat.codes # target
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=config.train_size,
random_state=config.SEED)
CPU times: user 2.59 ms, sys: 1.02 ms, total: 3.61 ms Wall time: 3.13 ms
sorted(y_train.unique()), sorted(y_test.unique())
([0, 1, 2, 3, 4, 5, 6, 7, 8], [0, 1, 2, 3, 4, 5, 6, 7, 8])
from sklearn.svm import LinearSVC
RE_TRAIN = True
if RE_TRAIN:
tfidf = TfidfVectorizer(**config.params_tfidf)
fitted_vectorizer = tfidf.fit(X_train)
tfidf_vectorizer_vectors = fitted_vectorizer.transform(X_train)
model = svm.LinearSVC(**config.params_linsvc)
model.fit(tfidf_vectorizer_vectors, y_train)
joblib.dump(model, model_linsvc_tfidf_path )
joblib.dump(fitted_vectorizer, tfidf_fitted_vec_path)
model = joblib.load(model_linsvc_tfidf_path)
fitted_vectorizer = joblib.load(tfidf_fitted_vec_path)
X_train_text = fitted_vectorizer.transform(X_train)
X_test_text = fitted_vectorizer.transform(X_test)
ypreds = model.predict(X_test_text)
print('Accuracy : {:.4f} '.format(metrics.accuracy_score(y_test,ypreds)))
Accuracy : 0.8125
show_methods(model)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | C | fit | max_iter | random_state |
1 | class_weight | fit_intercept | multi_class | score |
2 | classes_ | get_params | n_features_in_ | set_params |
3 | coef_ | intercept_ | n_iter_ | sparsify |
4 | decision_function | intercept_scaling | penalty | tol |
5 | densify | loss | predict | verbose |
6 | dual |
# linear svm does not have probs, we need to use calibrated classifier
from sklearn.calibration import CalibratedClassifierCV
clf = CalibratedClassifierCV(model)
clf.fit(X_train_text, y_train)
yprobs = clf.predict_proba(X_test_text)
yprobs[:5]
array([[1.03393898e-03, 1.11447689e-02, 9.46803176e-01, 2.29536627e-02, 2.25909015e-03, 1.79027402e-03, 2.61764108e-03, 3.52172169e-03, 7.87572657e-03], [2.27591259e-01, 6.79365421e-01, 4.45497356e-02, 1.20384988e-02, 9.34615172e-03, 9.83929076e-03, 4.75855963e-03, 3.37268296e-03, 9.13840023e-03], [1.45642463e-02, 1.28215433e-02, 8.28846294e-01, 1.16644359e-01, 1.83326199e-03, 7.39376921e-03, 3.35941513e-03, 8.27152674e-03, 6.26558447e-03], [4.20867212e-03, 4.96180288e-01, 4.43500619e-01, 3.69750322e-02, 1.94862440e-03, 1.84292341e-03, 4.79874448e-03, 7.81116049e-04, 9.76398018e-03], [5.56716064e-03, 3.22849516e-03, 9.78089421e-02, 7.99976820e-01, 4.24207748e-03, 1.04630991e-02, 3.56641015e-02, 9.17440176e-03, 3.38749025e-02]])
np.savetxt('../outputs/ytest.csv',y_test,fmt='%d')
np.savetxt('../outputs/ypreds_linsvc.csv',ypreds,fmt='%d')
np.savetxt('../outputs/yprobs_linsvc.csv',yprobs)
df_preds = pd.DataFrame({'ytest': y_test, 'ypreds': ypreds})
df_preds.head()
ytest | ypreds | |
---|---|---|
58324 | 2 | 2 |
96981 | 1 | 1 |
29952 | 2 | 2 |
70705 | 2 | 2 |
109585 | 3 | 3 |
df_preds.query("ytest != ypreds").head()
ytest | ypreds | |
---|---|---|
72777 | 2 | 3 |
112179 | 2 | 1 |
50082 | 2 | 1 |
52608 | 2 | 3 |
16212 | 3 | 2 |
dic_id_to_product = dict(enumerate(df['product'].unique()))
dic_product_to_id = {v:k for k,v in dic_id_to_product.items()}
ser_id_to_product = pd.Series(dic_id_to_product)
ser_product_to_id = pd.Series(dic_product_to_id)
for predicted in ser_id_to_product.index:
for actual in ser_id_to_product.index:
if predicted != actual and conf_mat[actual, predicted] >= 20:
print("'{}' predicted as '{}' : {} examples.".format(dic_id_to_product[actual],
dic_id_to_product[predicted],
conf_mat[actual, predicted]))
# indices_test is from train-test split
display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['product',
'complaint']])
print('')
'Debt collection' predicted as 'Mortgage' : 21 examples.
product | complaint | |
---|---|---|
23968 | Debt collection | XXXX XXXX XXXX believes I owe them {$7800.00} for terminating a lease in 2015. I gave them more than enough notice that I would have to leave the apartment due t the lack of affordability because ... |
23699 | Debt collection | In response to a denial of an extension of credit this consumer checked with the consumer reporting agencies and found the following : 1. Your company has furnished negative information about this... |
11400 | Debt collection | I was Evicted from my home in XX/XX/2017 I paid all my debt from the landlord and then a year later the name Hunter Warfield showed on my credit report. I was never given a notice about this charg... |
39758 | Debt collection | I've lived in P.R all my life, never in the U.S. Since XX/XX/2017 I have received collection notifications from different creditors of the U.S. I already reported to the P.R. Police Department, at... |
89226 | Debt collection | This letter is to inform you that Lending Club has failed to respond to my credit dispute letter and failed to verify that this account belongs to me that I sent certified mail on XX/XX/2019. This... |
51947 | Debt collection | On XX/XX/2018 I have contacted a agency called Credit Collection Services and XXXX XXXX XXXX advising them that I discovered a account that has been opened as a result of fraud. This agency failed... |
30341 | Debt collection | I am a single mother. I recently tried to purchase a home for my family and was denied. I than reviewed my own credit report and seen a lot of unauthorized credit inquires on my credit report that... |
14577 | Debt collection | I have called this company and told them this is not my account, they continue to refuse to accept it. I asked for proof to be provided, they sent me a letter with an address that I do not recogni... |
13315 | Debt collection | FCO has reported a collection on my credit report this year. I had no idea that I had a collection because I pay all my bills on time. I reached out to FCO who collected all my personal info, conf... |
25393 | Debt collection | The following Hard inquiries were made on my credit : XXXX XXXX XXXX XX/XX/XXXX XXXX XXXX XXXX XXXX XX/XX/2018 XXXX XXXX XXXX XXXX XX/XX/XXXX XXXX XXXX XXXX XX/XX/XXXX In XX/XX/2018, I Applied cre... |
90015 | Debt collection | I recently reviewed my XXXX credit report and I was totally shocked to find Capital One Bank is still reporting these fraudulent accounts on my credit report. I am requesting for Capital One Bank ... |
58634 | Debt collection | I received several emails from Bank of America about settling an outstanding debt for {$42.00}. I reached out to the company on XX/XX/2019. I spoke to a collection agent regarding the email and sh... |
57907 | Debt collection | KINGS CREDIT SERVICE XXXX XXXX XXXX XXXX XXXX, CA XXXX ( XXXX ) XXXX Kings Credit Service Opened XX/XX/2018 {$46.00} Original creditor : XXXX XXXX XXXX XXXX |
80607 | Debt collection | I applied to rent an apartment at XXXX in XXXX XXXX while it was still under construction in XXXX of 2019. My application was denied and I never moved in. A few months later I noticed that I had a... |
85527 | Debt collection | NOTICE OF PENDING LITIGATION SEEKING RELIEF AND MONETARY DAMAGES UNDER FCRA SECTION 616 & SECTION 617/// TRIDENT ASST MANAG IS REPORTING AN ACCOUNT ON MY CREDIT THAT IS NOT MINE/INACCURATE.FRAUD. ... |
42457 | Debt collection | SOUTHERN MANAGEMENT SYSTEMS IS REPORTING FALSE INFORMATION ON MY CREDIT REPORT! REMOVE ALL NEGATIVE ITEMS ON CREDIT REPORT. |
8244 | Debt collection | I was an XXXX customer. I had a complaint filed with the XXXX for unfair sales practices and fraud. ( they added services specifically XXXX XXXX/XXXX XXXX that I did not request. They also chang... |
27520 | Debt collection | Again this XXXX XXXX XXXX has sent nothing other than a generic letter. They responded to your company saying it will be removed soon but nothing explaining when and this is past the statue of lim... |
5846 | Debt collection | The Mini Van was reported stolen to the Police but resolved that it was retrieve by XXXX XXXX as repossess without notice. |
99224 | Debt collection | Transworld apparently purchased an account from XXXX or took over an account from XXXX for which I had an open dispute, they have reported it to credit reporting agencies negatively impacting my ... |
13818 | Debt collection | XXXX XXXX XXXX XXXX XXXX XXXX XXXX, GA XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX, XXXX XXXX XXXX # XXXX To Whom It May Concern : This letter is being sent to you in response to notic... |
# tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
# ngram_range=(1, 2),
# stop_words='english')
# # create vectors
# features = tfidf.fit_transform(df['complaint_clean']).toarray()
# labels = df['category_id']
model = LinearSVC()
model.fit(features, labels)
LinearSVC()
def get_top_N_correlated(N=4,ser_id_to_product=ser_id_to_product):
products,top_uni,top_bi = [],[],[]
for category_id, product in ser_id_to_product.iteritems():
indices = np.argsort(model.coef_[category_id])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
products.append(product)
top_uni.append(', '.join(unigrams[-N:]))
top_bi.append(', '.join(bigrams[-N:]))
# dataframe
df_top_corr = pd.DataFrame({'product': products,
'unigram': top_uni,
'bigram': top_bi})
return df_top_corr
df_top_corr = get_top_N_correlated(N=4)
df_top_corr.style.set_caption('Top Correlated Terms per Category')
product | unigram | bigram | |
---|---|---|---|
0 | Student loan | branch, bank, deposit, overdraft | saving account, called bank, checking account, card payment |
1 | Credit reporting, credit repair services, or other personal consumer reports | card, capital, express, statement | credit card, american express, card account, fraudulent charge |
2 | Mortgage | experian, report, equifax, reporting | credit bureau, xxxx reporting, fraud alert, victim identity |
3 | Debt collection | debt, collection, calling, phone | account credit, certified mail, time day, funding llc |
4 | Money transfer, virtual currency, or money service | paypal, transfer, ticket, transaction | money order, money account, transfer fund, account said |
5 | Vehicle loan or lease | mortgage, escrow, home, foreclosure | loan modification, escrow account, short sale, loan officer |
6 | Credit card or prepaid card | loan, lending, title, lied | received loan, loan told, called asked, loan agreement |
7 | Checking or savings account | navient, university, loan, owned | loan forgiveness, fed loan, student loan, thank time |
8 | Payday loan, title loan, or personal loan | car, vehicle, leased, gm | gm financial, auto loan, fee payment, auto finance |
X = df['complaint_clean'] # documents
y = df['product'].astype('category').cat.codes # target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25,
random_state = SEED)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
ngram_range=(1, 2),
stop_words='english')
fitted_vectorizer = tfidf.fit(X_train)
tfidf_vectorizer_vectors = fitted_vectorizer.transform(X_train)
model = LinearSVC().fit(tfidf_vectorizer_vectors, y_train)
# Save the fitted model (model persistence)
joblib.dump(model, '../models/tfidf.pkl')
['../models/tfidf.pkl']
new_complaint = """Hello : ditech.com is my mortgagecompany.
They placed an automatic forbearance on my account
and removed my auto payment after
Hurricane Irma.
I called about a week after the storm
to ask that they remove the forbearance
and return the auto payment.
This was confirm by the agent
and recorded by them.
I received a letter just a few
weeks ago stating that my auto payment
was never returned and the agent who
I spoke with after I received the
letter actually read back the notes
confirming that I called and asked
to have forbearance removed and auto
payment reinstated.
So I asked again the agent
to remove the forbearance and install auto payment.
\n\nI called this past week to check
if this was done yet, and the agent
at that time said I still have
a forbearance and no auto payment.
\n\nAs I right this complaint,
I spoke with an agent today that
informs me that I dont have auto
payment and forbearance is still active.
She placed me on hold, which has lasted an hour.
\n\nDitech is not responsive,
and it is purposely choosing
to keep my in forbearance when
I have asked countless times to remove me.
I also have asked countless times
to reinstate auto payment and yet
they choose not to listen.
\n\nPlease help XXXX XXXX, XXXX"""
model_loaded = joblib.load('../models/tfidf.pkl')
new_comp_vec = fitted_vectorizer.transform([new_complaint])
pred = model_loaded.predict(new_comp_vec)
print(pred)
[2]
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
'{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole notebook: 0 hr 7 min 10 secs