Decision Trees and Random Forests in Python

Hey - Nick here! This page is a free excerpt from my new eBook Pragmatic Machine Learning, which teaches you real-world machine learning techniques by guiding you through 9 projects.

Since you're reading my blog, I want to offer you a discount. Click here to buy the book for 70% off now.

The random forest is a machine learning classification algorithm that consists of numerous decision trees.

Each decision tree in the random forest contains a random sampling of features from the data set. Moreover, when building each tree, the algorithm uses a random sampling of data points to train the model.

In this tutorial, you will learn how to build your first random forest in Python. This article includes a real-world data set, a full codebase, and further instructions if you'd like to learn more about machine learning once you're finished.

Table of Contents

You can skip to a specific section of this Python random forests tutorial using the table of contents below:

The Data Set We Will Need For This Tutorial

In this tutorial, we will be using a data set of kyphosis patients and building a random forest algorithm to predict whether or not patients have the disease.

You'll need to download the data set before proceeding. I have uploaded the data set to my website to make this easy for you. Simply click here to download the file. Once it's downloaded, move the file to the appropriate directory and open a Jupyter Notebook.

The Imports We Will Need For This Tutorial

We will be relying on a number of open-source software libraries to build our random forests model, including NumPy, pandas, and matplotlib. Let's start by importing those libraries with the following code:

#Numerical computing libraries

import pandas as pd

import numpy as np

#Visalization libraries

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

Now that our imports have been executing, we are ready to import our data set into our Python script.

Importing The Data Set Into Our Python Script

You can import the kyphosis data set into your Python script using pandas' read_csv method, like this:

raw_data = pd.read_csv('kyphosis-data.csv')

Let's take a look at the features included in this data set:

Raw_data.columns

This returns:

Index(['Kyphosis', 'Age', 'Number', 'Start'], dtype='object')

This data set represents a group of patients that previously had kyphosis, and then were tested again after having back surgery.

The Kyphosis column contains a value of present or absent depending on whether or not they had kyphosis, while the Age column contains the patient's age in months. The Number column indicates the number of vertebrae involved in the operation. The Start column describes the top-mot vertebrae that was operated on.

Now that we have imported our data set, let's move on to performing some exploratory data analysis.

Exploratory Data Analysis

Exploratory data analysis is the process of learning more about a data set before performing building machine learning models with it. It often involves calculating aggregate data or building visualizations.

Let's dig in to some brief exploratory data analysis before building and training our machine learning model.

Determining The Size Of The Data Set

One characteristic that machine learning engineers should always understand before building their models is the size of their data set.

pandas makes this very easy to determine. Simply invoke the info method on your pandas DataFrame like this:

raw_data.info()

This generates:

RangeIndex: 81 entries, 0 to 80

Data columns (total 4 columns):

Kyphosis    81 non-null object

Age         81 non-null int64

Number      81 non-null int64

Start       81 non-null int64

dtypes: int64(3), object(1)

memory usage: 2.7+ KB

As you can see, there are 81 observations in this data set. This is a relatively small data set to be performing machine learning predictions on, but since this is simply an educational tutorial we are fine to proceed nonetheless.

Visualizing the Data

Since the data set is fairly small, we can use the seaborn library to easily visualize what is happening with each feature.

Here is the command to do this:

sns.pairplot(raw_data, hue = 'Kyphosis')

Here is the plot that this seaborn command generates:

A seaborn pairplot of our Kyphosis data set

Now that we have a sense of how our data set is structured, let's divide the data set into training data and test data.

Splitting The Data Set Into Training Data and Test Data

We will be using scikit-learn's train_test_split function combined with list unpacking to create our training data and test data. Specifically, we will be using a test size of 30%.

First, let's import the train_test_split function from scikit-learn:

from sklearn.model_selection import train_test_split

Next, we need to specify the x and y data from the data set. The x data will be all of the data except for the Kyphosis column, while the y data will be the Kyphosis column by itself.

Here are the Python statements to create this division in the data set:

x = raw_data.drop('Kyphosis', axis = 1)

y = raw_data['Kyphosis']

Lastly, here is the command to create our training-test splits:

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

We have successfully divided our data set into training data and test data.

Next up, we will continue this tutorial by building and training a decision tree algorithm on this data.

Later, we will also build a random forests model on the same training data and test data and see how its results compare with a more basic decision tree model.

Building and Training our Decision Tree Model

The first thing we need to do is import the DecisionTreeClassifier class from the tree module of scikit-learn. Run the following command to do so:

from sklearn.tree import DecisionTreeClassifier

Now we need to create an instance of this class and assign it to the variable model:

model = DecisionTreeClassifier()

Our model has been created. Now we need to train it using our training data.

This is done in the same way as with our linear regression, logistic regression, and K-nearest neighbors models earlier in this course: by using the fit method.

Invoke the fit method on your model object and pass in x_training_data and y_training_data, as follows:

model.fit(x_training_data, y_training_data)

Our kyphosis model has been trained. Let's make some predictions using this model.

Making Predictions Using Our Decision Tree Model

To make predictions using our model object, simply call the predict method on it and pass in the x_test_data variables. You can assign these predictions to a variable named predictions.

More specifically, here is the code to do this:

predictions = model.predict(x_test_data)

Now that our predictions have been made, let's assess the accuracy of our model using some of scikit-learn's built-in functionality.

Measuring the Performance of Our Decision Tree Model

We will be using scikit-learn's built-in functions classification_report and confusion_matrix to assess the performance of our decision tree machine learning model.

First, let's import these functions:

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

Next, let's generate a classification_report:

print(classification_report(y_test_data, predictions))

This generates:

             precision    recall  f1-score   support

      absent       0.85      0.89      0.87        19

     present       0.60      0.50      0.55         6

    accuracy                           0.80        25

   macro avg       0.72      0.70      0.71        25

weighted avg       0.79      0.80      0.79        25

We can generate a confusion_matrix in a similar manner:

print(confusion_matrix(y_test_data, predictions))

This generates:

[[17  2]

 [ 3  3]]

Overall, our model seems to be doing a fairly good job of making predictions on our test data. It is only making incorrect predictions on 5 data points (2 false positives and 3 false negatives, as evidenced by the confusion_matrix).

In the next section, we will begin building a random forests model whose performance we will compare to our model object later in this tutorial.

Building and Training Our Random Forests Model

To build our random forests model, we will first need to import the model from scikit-learn. Here is the command to do this:

from sklearn.ensemble import RandomForestClassifier

Next, we need to create the random forests model.

Since we do not want to overwrite the model variable that we created earlier, we will not name it model. Instead, let's name it random_forest_model:

random_forest_model = RandomForestClassifier()

Note that the RandomForestClassifier class has a parameter named n_estimators that specifies the number of trees in the forest. Its default value is 100, but you can change this value if you'd like. We will be using the default value of 100 in this tutorial.

Note its time to train the random forests model. To do this, we use the fit method, as before:

random_forest_model.fit(x_training_data, y_training_data)

Our random forest model has been trained. Let's move on to making some predictions with this new ensemble model.

Making Predictions Using Our Random Forest Model

Let's use the predict method to calculate some predictions using our random_forest_model object and assign them to a variable called random_forest_predictions:

random_forest_predictions = random_forest_model.predict(x_test_data)

We will assess the accuracy of these predictions next.

Measuring the Performance of Our Decision Tree Model

As we did with our basic decision tree model, let's generate a classification_report and confusion_matrix.

Let's start with the classification_report:

print(classification_report(y_test_data, random_forest_predictions))

Here is the output of this report:

             precision    recall  f1-score   support

      absent       0.82      0.95      0.88        19

     present       0.67      0.33      0.44         6

    accuracy                           0.80        25

   macro avg       0.74      0.64      0.66        25

weighted avg       0.78      0.80      0.77        25

Now let's generate a confusion matrix:

print(confusion_matrix(y_test_data, random_forest_predictions))

Here is the output of this confusion matrix:

[[18  1]

 [ 4  2]]

In this case, our random forest has not performed significantly better than our decision tree model.

This is primarily because our data set is small. In almost all cases, random forests will perform better than basic decision trees - especially as the data set that you're using to make predictions gets larger and larger.

The Full Code For This Tutorial

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

#Numerical computing libraries

import pandas as pd

import numpy as np

#Visalization libraries

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

raw_data = pd.read_csv('kyphosis-data.csv')

raw_data.columns

#Exploratory data analysis

raw_data.info()

sns.pairplot(raw_data, hue = 'Kyphosis')

#Split the data set into training data and test data

from sklearn.model_selection import train_test_split

x = raw_data.drop('Kyphosis', axis = 1)

y = raw_data['Kyphosis']

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

#Train the decision tree model

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(x_training_data, y_training_data)

predictions = model.predict(x_test_data)

#Measure the performance of the decision tree model

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

print(classification_report(y_test_data, predictions))

print(confusion_matrix(y_test_data, predictions))

#Train the random forests model

from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier()

random_forest_model.fit(x_training_data, y_training_data)

random_forest_predictions = random_forest_model.predict(x_test_data)

#Measure the performance of the random forest model

print(classification_report(y_test_data, random_forest_predictions))

print(confusion_matrix(y_test_data, random_forest_predictions))

Final Thoughts

In this tutorial, you learned how you build decision trees and random forests in Python.

Here is a brief summary of what you learned in this article:

  • How to build a decision tree model using scikit-learn
  • How to build a random forest model using scikit-learn
  • That random forests typically are better predictors than decisions trees - especially with large data sets