# Decision Trees and Random Forests in Python

Hey - Nick here! This page is a free excerpt from my new eBook Pragmatic Machine Learning, which teaches you real-world machine learning techniques by guiding you through 9 projects.

The random forest is a machine learning classification algorithm that consists of numerous decision trees.

Each decision tree in the random forest contains a random sampling of features from the data set. Moreover, when building each tree, the algorithm uses a random sampling of data points to train the model.

In this tutorial, you will learn how to build your first random forest in Python. This article includes a real-world data set, a full codebase, and further instructions if you’d like to learn more about machine learning once you’re finished.

## The Data Set We Will Need For This Tutorial

In this tutorial, we will be using a data set of kyphosis patients and building a random forest algorithm to predict whether or not patients have the disease.

## The Imports We Will Need For This Tutorial

We will be relying on a number of open-source software libraries to build our random forests model, including NumPy, pandas, and matplotlib. Let’s start by importing those libraries with the following code:

``````
#Numerical computing libraries

import pandas as pd

import numpy as np

#Visalization libraries

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

``````

Now that our imports have been executing, we are ready to import our data set into our Python script.

## Importing The Data Set Into Our Python Script

You can import the kyphosis data set into your Python script using pandas’ `read_csv` method, like this:

``````

``````

Let’s take a look at the features included in this data set:

``````
Raw_data.columns

``````

This returns:

``````
Index(['Kyphosis', 'Age', 'Number', 'Start'], dtype='object')

``````

This data set represents a group of patients that previously had kyphosis, and then were tested again after having back surgery.

The `Kyphosis` column contains a value of `present` or `absent` depending on whether or not they had kyphosis, while the `Age` column contains the patient’s age in months. The `Number` column indicates the number of vertebrae involved in the operation. The `Start` column describes the top-mot vertebrae that was operated on.

Now that we have imported our data set, let’s move on to performing some exploratory data analysis.

## Exploratory Data Analysis

Exploratory data analysis is the process of learning more about a data set before performing building machine learning models with it. It often involves calculating aggregate data or building visualizations.

Let’s dig in to some brief exploratory data analysis before building and training our machine learning model.

### Determining The Size Of The Data Set

One characteristic that machine learning engineers should always understand before building their models is the size of their data set.

`pandas` makes this very easy to determine. Simply invoke the `info` method on your pandas DataFrame like this:

``````
raw_data.info()

``````

This generates:

``````
RangeIndex: 81 entries, 0 to 80

Data columns (total 4 columns):

Kyphosis    81 non-null object

Age         81 non-null int64

Number      81 non-null int64

Start       81 non-null int64

dtypes: int64(3), object(1)

memory usage: 2.7+ KB

``````

As you can see, there are 81 observations in this data set. This is a relatively small data set to be performing machine learning predictions on, but since this is simply an educational tutorial we are fine to proceed nonetheless.

### Visualizing the Data

Since the data set is fairly small, we can use the `seaborn` library to easily visualize what is happening with each feature.

Here is the command to do this:

``````
sns.pairplot(raw_data, hue = 'Kyphosis')

``````

Here is the plot that this `seaborn` command generates: Now that we have a sense of how our data set is structured, let’s divide the data set into training data and test data.

Splitting The Data Set Into Training Data and Test Data

We will be using `scikit-learn`’s `train_test_split` function combined with list unpacking to create our training data and test data. Specifically, we will be using a test size of 30%.

First, let’s import the `train_test_split` function from `scikit-learn`:

``````
from sklearn.model_selection import train_test_split

``````

Next, we need to specify the `x` and `y` data from the data set. The `x` data will be all of the data except for the `Kyphosis` column, while the `y` data will be the `Kyphosis` column by itself.

Here are the Python statements to create this division in the data set:

``````
x = raw_data.drop('Kyphosis', axis = 1)

y = raw_data['Kyphosis']

``````

Lastly, here is the command to create our training-test splits:

``````
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

``````

We have successfully divided our data set into training data and test data.

Next up, we will continue this tutorial by building and training a decision tree algorithm on this data.

Later, we will also build a random forests model on the same training data and test data and see how its results compare with a more basic decision tree model.

## Building and Training our Decision Tree Model

The first thing we need to do is import the `DecisionTreeClassifier` class from the `tree` module of `scikit-learn`. Run the following command to do so:

``````
from sklearn.tree import DecisionTreeClassifier

``````

Now we need to create an instance of this class and assign it to the variable `model`:

``````
model = DecisionTreeClassifier()

``````

Our model has been created. Now we need to train it using our training data.

This is done in the same way as with our linear regression, logistic regression, and K-nearest neighbors models earlier in this course: by using the `fit` method.

Invoke the `fit` method on your `model` object and pass in `x_training_data` and `y_training_data`, as follows:

``````
model.fit(x_training_data, y_training_data)

``````

Our kyphosis model has been trained. Let’s make some predictions using this model.

## Making Predictions Using Our Decision Tree Model

To make predictions using our `model` object, simply call the `predict` method on it and pass in the `x_test_data` variables. You can assign these predictions to a variable named `predictions`.

More specifically, here is the code to do this:

``````
predictions = model.predict(x_test_data)

``````

Now that our predictions have been made, let’s assess the accuracy of our model using some of `scikit-learn`’s built-in functionality.

## Measuring the Performance of Our Decision Tree Model

We will be using `scikit-learn`’s built-in functions `classification_report` and `confusion_matrix` to assess the performance of our decision tree machine learning model.

First, let’s import these functions:

``````
from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

``````

Next, let’s generate a `classification_report`:

``````
print(classification_report(y_test_data, predictions))

``````

This generates:

``````
precision    recall  f1-score   support

absent       0.85      0.89      0.87        19

present       0.60      0.50      0.55         6

accuracy                           0.80        25

macro avg       0.72      0.70      0.71        25

weighted avg       0.79      0.80      0.79        25

``````

We can generate a `confusion_matrix` in a similar manner:

``````
print(confusion_matrix(y_test_data, predictions))

``````

This generates:

``````
[[17  2]

[ 3  3]]

``````

Overall, our model seems to be doing a fairly good job of making predictions on our test data. It is only making incorrect predictions on 5 data points (2 false positives and 3 false negatives, as evidenced by the `confusion_matrix`).

In the next section, we will begin building a random forests model whose performance we will compare to our `model` object later in this tutorial.

## Building and Training Our Random Forests Model

To build our random forests model, we will first need to import the model from `scikit-learn`. Here is the command to do this:

``````
from sklearn.ensemble import RandomForestClassifier

``````

Next, we need to create the random forests model.

Since we do not want to overwrite the `model` variable that we created earlier, we will not name it `model`. Instead, let’s name it `random_forest_model`:

``````
random_forest_model = RandomForestClassifier()

``````

Note that the `RandomForestClassifier` class has a parameter named `n_estimators` that specifies the number of trees in the forest. Its default value is `100`, but you can change this value if you’d like. We will be using the default value of `100` in this tutorial.

Note its time to train the random forests model. To do this, we use the `fit` method, as before:

``````
random_forest_model.fit(x_training_data, y_training_data)

``````

Our random forest model has been trained. Let’s move on to making some predictions with this new ensemble model.

## Making Predictions Using Our Random Forest Model

Let’s use the `predict` method to calculate some predictions using our `random_forest_model` object and assign them to a variable called `random_forest_predictions`:

``````
random_forest_predictions = random_forest_model.predict(x_test_data)

``````

We will assess the accuracy of these predictions next.

## Measuring the Performance of Our Decision Tree Model

As we did with our basic decision tree model, let’s generate a `classification_report` and `confusion_matrix`.

Let’s start with the `classification_report`:

``````
print(classification_report(y_test_data, random_forest_predictions))

``````

Here is the output of this report:

``````
precision    recall  f1-score   support

absent       0.82      0.95      0.88        19

present       0.67      0.33      0.44         6

accuracy                           0.80        25

macro avg       0.74      0.64      0.66        25

weighted avg       0.78      0.80      0.77        25

``````

Now let’s generate a confusion matrix:

``````
print(confusion_matrix(y_test_data, random_forest_predictions))

``````

Here is the output of this confusion matrix:

``````
[[18  1]

[ 4  2]]

``````

In this case, our random forest has not performed significantly better than our decision tree model.

This is primarily because our data set is small. In almost all cases, random forests will perform better than basic decision trees - especially as the data set that you’re using to make predictions gets larger and larger.

## The Full Code For This Tutorial

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

``````
#Numerical computing libraries

import pandas as pd

import numpy as np

#Visalization libraries

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

raw_data.columns

#Exploratory data analysis

raw_data.info()

sns.pairplot(raw_data, hue = 'Kyphosis')

#Split the data set into training data and test data

from sklearn.model_selection import train_test_split

x = raw_data.drop('Kyphosis', axis = 1)

y = raw_data['Kyphosis']

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

#Train the decision tree model

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(x_training_data, y_training_data)

predictions = model.predict(x_test_data)

#Measure the performance of the decision tree model

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

print(classification_report(y_test_data, predictions))

print(confusion_matrix(y_test_data, predictions))

#Train the random forests model

from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier()

random_forest_model.fit(x_training_data, y_training_data)

random_forest_predictions = random_forest_model.predict(x_test_data)

#Measure the performance of the random forest model

print(classification_report(y_test_data, random_forest_predictions))

print(confusion_matrix(y_test_data, random_forest_predictions))

``````

## Final Thoughts

In this tutorial, you learned how you build decision trees and random forests in Python.

Here is a brief summary of what you learned in this article:

• How to build a decision tree model using `scikit-learn`
• How to build a random forest model using `scikit-learn`
• That random forests typically are better predictors than decisions trees - especially with large data sets