In this lesson, you'll learn how to create boxplots in Python using matplotlib.
The Imports We'll Need For This Lesson
As before, the code cells in the lesson will assume that you have already performed the following imports:
import matplotlib.pyplot as plt
import pandas as pd
The Dataset We Will Be Using In This Lesson
In our first lesson on using pyplot, we used fake datasets generated using NumPy's random number generator. While this can be useful for educational purposes, it is time for us to begin working with a real-world dataset.
The Iris dataset is so commonly used for machine learning and deep learning practice that it is actually included in many data visualization and statistical libraries for Python. However, we are not using any of those libraries. Because of this, we will import the Iris dataset manually.
To make things easy for you, I have uploaded a json file containing the iris dataset to the GitHub repository for this course. You can find it in the folder iris with the filename iris.json.
You can import this dataset into your Python script using the following command:
import pandas as pd
iris_data = pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/python-visualization/master/iris/iris.json')
The iris data set is a collection of data points for flowers with the following data fields:
It is an ideal candidate for creating boxplots using matlpotlib.
How To Create Boxplots in Python Using Matplotlib
We will now learn how to create a boxplot using Python. Note that boxplots are sometimes call 'box and whisker' plots, but I will be referring to them as boxplots throughout this course.
First, what is a boxplot?
A boxplot is a chart that has the following image for each data point (like sepalWidth or petalWidth) in a dataset:
Each specific component of this boxplot has a very well-defined meaning. They are labeled in the following image.
For those unfamiliar with the terminology of this diagram, they are described below:
Q1: The first quartile of the dataset. 25% of values lie below this level.
Q2: The second quartile of the dataset. 50% of values lie above and below this level.
Q3: The third quartile of the dataset. 25% of values lie above this level.
The boxplot 'Minimum', defined as Q1 less 1.5 times the interquartile range.
The boxplot Maximum, defined as Q3 plus 1.5 times the interquartile range.
The median: the midpoint of the datasets.
Interquartile range: the distance between Q1 and Q3.
Outliers: data points that are below Q1 or above Q3.
So how can we actually create a boxplot using matplotlib?
First, we will have to drop any non-numerical columns from the iris_data DataFrame.
The only column that is non-numerical is species.
We can drop species from iris_data using the drop method, like this:
iris_data = iris_data.drop('species', axis=1)
Now that the dataset contains only numerical values, we are ready to create our first boxplot!
You can create a boxplot using matlplotlib's boxplot function, like this:
The resulting chart looks like this:
As you've probably guessed, this is not what we wanted our boxplot to look like! What is the solution?
If you look closely at this chart, it becomes clear that this is creating a boxplot where there is a chart for each row, not a chart for each column. The solution for this is to transpose the DataFrame using the transpose method.
You can either do this in separate lines, like this: