Identifying Risky Bank Loans Using C5.0 Decision Trees by Brett Lantz

Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. There are numerous implementations of decision trees, but the most well-known is the C5.0 algorithm. In this example, we will look at developing a simple credit approval model using this algorithm to create a tree to predict whether a loan applicant will default. In the example, we work out the credit approval model based on the C5.0 algorithm in the C50 package for training our decision tree model with 73 percent of accuracy. We have used a dataset donated to the UCI Machine Learning Repository ( http://archive.ics.uci.edu/ml ) by Hans Hofmann of the University of Hamburg.

ruyQJrr.png!web

The global financial crisis of 2007–2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans.

This article is an excerpt from the book, Machine Learning with R, Third Edition written by Brett Lantz. This book provides a hands-on, readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, this book teaches you everything you need to uncover key insights, make new predictions, and visualize your findings.

Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. In this article section, we will develop a simple credit approval model using C5.0 decision trees.

Step 1 — collecting data

The motivation for our credit model is to identify factors that are linked to a higher risk of loan default. For this article, we’ll use a dataset donated to the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. The dataset contains information on loans obtained from a credit agency in Germany.

The credit dataset includes 1,000 examples of loans, plus a set of numeric and nominal features indicating characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Let’s see if we can identify any patterns that predict this outcome.

Step 2 — exploring and preparing the data

We will import the data using the read.csv() function. This creates a credit data frame with a number of factor variables:

> credit <- read.csv(“credit.csv”)

We can check the resulting object by examining the first few lines of output from the str() function:

> str(credit)

‘data.frame’:1000 obs. of 17 variables:

$ checking_balance : Factor w/ 4 levels “< 0 DM”,”> 200 DM”,..

$ months_loan_duration: int 6 48 12 …

$ credit_history : Factor w/ 5 levels “critical”,”good”,..

$ purpose : Factor w/ 6 levels “business”,”car”,..

$ amount : int 1169 5951 2096 …

We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types.

Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default. The applicant’s checking and savings account balance are recorded as categorical variables:

> table(credit$checking_balance)

< 0 DM > 200 DM 1–200 DM unknown

274 63 269 394

> table(credit$savings_balance)

< 100 DM > 1000 DM 100–500 DM 500–1000 DM unknown

603 48 103 63 183

The checking and savings account balance may prove to be important predictors of loan default status. Some of the loan’s features are numeric, such as its duration and the amount of credit requested:

> summary(credit$months_loan_duration)

Min. 1st Qu. Median Mean 3rd Qu. Max.

4.0 12.0 18.0 20.9 24.0 72.0

> summary(credit$amount)

Min. 1st Qu. Median Mean 3rd Qu. Max.

250 1366 2320 3271 3972 18420

The loan amounts ranged from 250 DM to 18,420 DM across terms of four to 72 months. They had a median amount of 2,320 DM and median duration of 18 months.

The default vector indicates whether the loan applicant was able to meet the agreed payment terms or if they went into default. A total of 30 percent of the loans in this dataset went into default:

> table(credit$default)

no yes

700 300

A high rate of default is undesirable for a bank because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants who are at high risk of default, allowing the bank to refuse the credit request before the money is given.

Step 3 — training a model on the data

We will use the C5.0 algorithm in the C50 package for training our decision tree model. If you have not done so already, install the package with install.packages(“C50”) and load it to your R session using library(C50).

For the first iteration of the credit approval model, we’ll use the default C5.0 settings, as shown in the following code. Column 17 in credit_train is the class variable, default, so we need to exclude it from the training data frame and supply it as the target factor vector for classification:

> credit_model <- C5.0(credit_train[-17], credit_train$default)

The credit_model object now contains a C5.0 decision tree. We can see some basic data about the tree by typing its name:

> credit_model

Call:

C5.0.default(x = credit_train[-17], y = credit_train$default)

Classification Tree

Number of samples: 900

Number of predictors: 16

Tree size: 57

Non-standard options: attempt to group attributes

The output shows some simple facts about the tree, including the function call that generated it, the number of features (labeled predictors), and examples (labeled samples) used to grow the tree. Also listed is the tree size of 57, which indicates that the tree is 57 decisions deep — quite a bit larger than the example trees we’ve considered so far!

To see the tree’s decisions, we can call the summary() function on the model:

> summary(credit_model)

This results in the following output:

iMf6NzB.png!web

The preceding output shows some of the first branches in the decision tree. The first

three lines could be represented in plain language as:

If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.”
Otherwise, if the checking account balance is less than zero DM or between one and 200 DM…
… and the credit history is perfect or very good, then classify as “likely to default.”

The numbers in parentheses indicate the number of examples meeting the criteria for that decision and the number incorrectly classified by the decision. For instance, on the first line, 412/50 indicates, 50 applicants actually defaulted in spite of the model’s prediction to the contrary.

After the tree, the summary(credit_model) output displays a confusion matrix, which is a cross-tabulation that indicates the model’s incorrectly classified records in the training data:

Evaluation on training data (900 cases):

The Errors heading shows that the model correctly classified all but 133 of the 900 training instances for an error rate of 14.8 percent. A total of 35 actual no values were incorrectly classified as yes (false positives), while 98 yes values were misclassified as no (false negatives).

Given the tendency of decision trees to overfit to the training data, the error rate reported here, which is based on training data performance, may be overly optimistic. Therefore, it is especially important to continue our evaluation by applying our decision tree to a test dataset.

Step 4 — evaluating model performance

To apply our decision tree to the test dataset, we use the predict() function as shown in the following line of code:

> credit_pred <- predict(credit_model, credit_test)

This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Setting the prop.c and prop.r parameters to FALSE removes the column and row percentages from the table. The remaining percentage (prop.t) indicates the proportion of records in the cell out of the total number of records:

> library(gmodels)

> CrossTable(credit_test$default, credit_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c(‘actual default’, ‘predicted default’))

This results in the following table:

zmaMnqn.png!web

Out of the 100 loan applications in the test set, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent.

Conclusion

In this article, we used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. The decision trees built by C5.0 generally perform well and are much easier to understand and deploy. In this example, we worked out the credit approval model with 73 percent of accuracy. To investigate further, you can refer to Brett Lantz’s latest book Machine Learning with R, Third Edition .

About the Author

Brett Lantz is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. A sociologist by training, Brett was first captivated by machine learning during research on a large database of social network profiles.

Step 1 — collecting data

Step 2 — exploring and preparing the data

Step 3 — training a model on the data

Step 4 — evaluating model performance

Conclusion

About the Author

Recommend

RSA - theory and implementation

The ultimate EDA visualization in R

数据中台实战（五）：自助分析平台（产品设计篇）

WICG/ScrollToTextFragment: Proposal to allow specifying a text snippet in a URL...

用户数从0到亿，我的 K8s 踩坑血泪史

安卓开发开发规范手册V1.0

相比 Pipenv，Poetry 是一个更好的选择 - 知乎

工作中遇到的99%SQL优化，这里都能给你解决方案 - 小强的进阶之路 - 博客园

超越Storm，SparkStreaming——Flink如何实现有状态的计算

Next browser 1.3.1: improved minibuffer and platform support

About Joyk