28

Decision Trees Explained

 4 years ago
source link: https://towardsdatascience.com/decision-trees-explained-3ec41632ceb6?gi=59f4970bd318
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Learn everything about Decision Trees for Machine Learning

Mar 8 ·8min read

MneuyyJ.jpg!web

Source: Unsplash

In this post, I will explain Decision Trees in simple terms. It could be considered a Decision Trees for dummies post, however, I’ve never really liked that expression.

Introduction and Intuition

In the Machine Learning world, Decision Trees are a kind of non parametric models, that can be used for both classification and regression.

This means that Decision trees are flexible models that don’t increase their number of parameters as we add more features (if we build them correctly), and they can either output a categorical prediction ( like if a plant is of a certain kind or not ) or a numerical prediction ( like the price of a house ).

They are constructed using two kinds of elements: nodes and branches . At each node, one of the features of our data is evaluated in order to split the observations in the training process or to make an specific data point follow a certain path when making a prediction.

mM3ANvu.png!web

At each node a variable is evaluated to decide which path to follow.

When they are being built decision trees are constructed by recursively evaluating different features and using at each node the feature that best splits the data. This will be explained in detail later.

Probably the best way to start the explanation is by seen what a decision tree looks like, to build a quick intuition of how they can be used. The following figure shows the general structure of one of these trees.

ABJ3Qza.png!web

Figure of a decision tree. Source .

In this figure we can observe three kinds of nodes:

  • The Root Node: Is the node that starts the graph. In a normal decision tree it evaluates the variable that best splits the data.
  • Intermediate nodes: These are nodes where variables are evaluated but which are not the final nodes where predictions are made.
  • Leave nodes: These are the final nodes of the tree, where the predictions of a category or a numerical value are made.

Alright, now that we have a general idea of what Decision trees are, let's see how they are built.

Training process of a Decision Tree

Like we mentioned previously, decision trees are built by recursively splitting our training samples using the features from the data that work best for the specific task. This is done by evaluating certain metrics, like the Gini index or the Entropy for categorical decision trees, or the Residual or Mean Squared Error for regression trees.

The process is also different if the feature that we are evaluating at the node is discrete or continuous. For discrete features all of its possible values are evaluated, resulting in N calculated metrics for each of the variables, being N the number of possible value for each categorical value. For continuous features the mean of each two consecutive values (ordered from lowest to highest) of the training data are used as possible thresholds.

The result of this process is, for a certain node, a list of variables, each with different thresholds, and a calculated metric (Gini or MSE) for each variable/threshold tandem. Then, we pick the variable/threshold combination that gives us the highest/lowest value for the specific metric that we are using for the resulting children nodes (the highest reduction or increase in the metric ).

We won’t go into how these metrics are calculated, as it is off the topic of this introductory post, however I will leave some resources at the end for you to dive deeper if you are interested. At the moment just think of these metrics (Gini for categorical trees and Mean Squared Error for regression trees) as some sort of error which we want to reduce.

Let's see an example of two decision trees, a categorical one and a regressive one to get a more clear picture of this process. The following figure shows a categorical tree built for the famous Iris Dataset , where we try to predict a category out of three different flowers, using features like the petal width, length, sepal length, …

7ry6R3u.png!web

Decision tree built for the Iris Dataset

We can see that the root node starts with 50 samples of each of the three classes, and a Gini Index (as it is a categorical tree the lower the Gini Index the better) of 0,667.

In this node, the feature that best split the different classes of the data is the petal width in cm, using as a threshold a value of 0,8. This results in two nodes, one with Gini 0 (perfectly pure node that only has one of the types of flowers) and one with Gini of 0.5, where the two other kinds of flowers are grouped.

In this intermediate node (False path from the root node), the same feature is evaluated (yes, this can happen, and it actually happens often if the feature is important) using a threshold of 1,75. Now this results in two other children nodes that are not pure, but that have a pretty low Gini Index.

In all of these nodes all the other features of the data ( sepal length, sepal width, and petal length ) were evaluated, and had their resulting Gini Index calculated, however, the feature that gave us the best results (lowest Gini Index) was the petal width.

The reason the tree didn’t continue growing is because Decision Trees always a growth-stop condition configured, otherwise they would grow until each training sample was separated into its own leave node. These stop conditions are maximum depth of the tree, minimum samples in leave nodes, or minimum reduction in the error metric.

Let's check out a Regression Tree now, for this, we will use the Boston House Price Dataset, resulting in the following graph:

iENBZnz.png!web

Decision tree built for the Boston Housing Dataset

As we can see in the previous figure, now we don’t have the Gini Index, but the MSE (Mean Squared Error). As in the previous example with the Gini, our tree is built using the feature/threshold combinations that most reduced this error.

The root node uses the variable LSTAT ( % lower status of the population in the area ) with a threshold of 9.725, to initially divide the samples. We can see that at the root node we have 506, that we divided into 212 (left children node) and 294 (right children node).

The left children node uses the variable RM ( number of rooms per dwelling ) with a threshold of 6.631, and the right node uses the same LSTAT variable with a threshold of 16.085, resulting in four beautiful leave nodes. As before all the other variables were evaluated at each node, but these two were the ones that best split the data.

Awesome! Now we know how Decision Trees are built. Let's learn how they are used to make predictions.

Making predictions with a Decision Tree

Predicting the category or numerical target value of a new sample is very easy using Decision Trees. That is one of the main advantages of these kinds of algorithms. All we have to do is start at the root node, look at the value of the feature that it evaluates, and depending on that value go to the left or right children node.

This process is repeated until we reach a leave node. When this happens, depending on whether we are facing a classification or a regression problem two things can happen:

a) If we are facing a classification problem , the predicted category would be the mode of the categories on that leave node. Remember how in the classification tree we had value = [0,49,5] on the middle leave node? This means that a test sample that reaches this node has the highest probability of belonging to the class with 49 training samples on that node, so we classify it as such.

b) For a regression tree , the prediction we make at the end is the mean of the values for the target variable at such leave node. In our housing example if a leave node had 4 samples with prices 20, 18, 22, and 24, then the predicted value at that node would be 21, the mean of the 4 training examples that end there.

In the following figure, we can see how a prediction would be made for a new test sample(a house) for the previous regression tree.

Note: Only the features of the house that are used in the tree are shown.

7vm6Rnj.png!web

The path a specific sample follows and value of the given prediction. Icon from Flaticon .

Alright! Now we know how to make predictions using decision trees. Let's finish by learning their advantages and disadvantages.

Pros vs Cons of Decision Trees

Advantages:

  • The main advantage of decision trees is how easy they are to interpret . While other machine Learning models are close to black boxes, decision trees provide a graphical and intuitive way to understand what our algorithm does.
  • Compared to other Machine Learning algorithms Decision Trees require less data to train.
  • They can be used for Classification and Regression .
  • They are simple .
  • They are tolerant to missing value s.

Disadvantages

  • They are quite prone to over fitting to the training data and can be sensible to outliers.
  • They are weak learners : a single decision tree normally does not make great predictions, so multiple trees are often combined to make ‘ forests ’ to give birth to stronger ensemble models. This will be discussed in a further post.

Conclusion and additional resources

Decision trees are simple and intuitive algorithms that are simple but intuitive, and because of this they are used a lot when trying to explain the results of a Machine Learning model. Despite being weak, they can be combined giving birth to bagging or boosting models, that are very powerful. In the next posts, we will explore some of these models.

If you want to know the full process for building a tree, check out the following video:

That is all, I hope you liked the post. Feel free to follow me on Twitter at @jaimezorno . Also, you can take a look at my posts on Data Science and Machine Learning here . Have a good read!

For more posts like this one follow me on Medium , and stay tuned!

Also, to go further into Decision Trees and Machine Learning in general, take a look at the book described in the following article:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK