# How To Create Histograms in Python Using Matplotlib

Hey - Nick here! This page is a free excerpt from my \$199 course Python for Finance, which is 50% off for the next 50 students.

In this lesson, you will learn how to create histograms in Python using matplotlib.

## The Imports You Will Need For This Lesson

As before, you will require the following imports to be able to complete this lesson:

``````import matplotlib.pyplot as plt

%matplotlib inline

import pandas as pd

from IPython.display import set_matplotlib_formats

set_matplotlib_formats('retina')``````

You will also need the iris data set. You can import the Iris data set with the following code:

``iris_data = pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/python-visualization/master/iris/iris.json')``

## What is a Histogram?

A histogram is a visualization that shows the prevalence of numerical data over a distribution. The horizontal axis is a particular value, and the vertical axis displays how many times that value was observed in the data set.

An example of a histogram is below. As you'll see, colors are used to differentiate between two subsets within the data.

One key concept that you must understand when working with histograms is the idea of `bins` - how many parts the total range of the data set is divided into. Changing the number of `bins` in a histogram does not change the data set. It only changes the appearance of the data in the histogram.

An example is helpful. Below, you can see two histograms. The histogram on the left has 50 bins and the histogram on the right has 10 bins.

In the next section, you'll learn how to create histograms in Python using matplotlib.

## How To Create Histograms in Python Using Matplotlib

We can create histograms in Python using matplotlib with the `hist` method.

The `hist` method can accept a few different arguments, but the most important two are:

• `x`: the data set to be displayed within the histogram.
• `bins`: the number of bins that the histogram should be divided into.

Let's create our first histogram using our `iris_data` variable.

We will first need to remove all non-numeric columns from the data set. Since the only non-numeric column is `species`, we can drop `species` from the DataFrame with the following command:

``iris_data.drop('species',axis=1)``

We can either assign this to a new variable (using a command like `new_iris_data = iris_data.drop('species',axis=1)`) or we can pass it directly into the `plt.hist()` method. I prefer the second option:

``plt.hist(iris_data.drop('species',axis=1))``

Here is what the resulting chart looks like:

This does not look right!

The reason why is because this histogram is plotting along the rows instead of along the columns. We can fix this by applying the `transpose` method to the DataFrame, like this:

``plt.hist(iris_data.drop('species',axis=1).transpose())``

This is a good start, but we can significantly improve on the appearance of this graph by adding a title to the graph, titles to its axes, and a legend. I used the following code to all these elements:

``````plt.legend(iris_data.drop('species', axis=1).columns)

plt.title('The frequency of different sepal and petal lengths and widths from the Iris data set.', fontsize=20)

plt.ylabel('Frequency', fontsize=20)

plt.xlabel('Length', fontsize=20)``````

You now have an understanding of the basics of how to create histograms in Python using matplotlib. In the next section, we will learn how to use histograms to assess subcategories of a data set (similar to our discussion of subsets in the scatterplot lesson).

## How To Assess Categorical Data Using Histograms in Python With Matplotlib

First, let's create three new data sets. The data sets will be the `sepalWidth` observation split across the three species in the data set: `setosa`, `versicolor`, and `virginica`.

You can do this with the following code:

``````iris_data = pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/python-visualization/master/iris/iris.json')

setosa_data = iris_data[iris_data['species'] == 'setosa']['sepalWidth']

versicolor_data = iris_data[iris_data['species'] == 'versicolor']['sepalWidth']

virginica_data = iris_data[iris_data['species'] == 'virginica']['sepalWidth']``````

Next, let's plot a histogram of this data using the `hist` method. Instead of passing in one value for `x`, pass in a list whose elements are `setosa_data`, `versicolor_data`, and `virginica_data`, like this:

``plt.hist([setosa_data,versicolor_data,virginica_data])``

As you can see, this chart makes it relatively easy to see trends for `sepalWidth` among each species. This effect becomes even more pronounced if you increase the histogram's bin count, like this:

``plt.hist([setosa_data,versicolor_data,virginica_data], bins = 30)``

As before, the chart becomes much easier to interpret if we add a chart title, axis titles, and a legend.

We can do this with the following code:

``````plt.legend(['Setosa','Versicolor','Virginica'], fontsize=20)

plt.title('Differences in Sepal Width for the 3 flower species in the Iris data set.', fontsize=20)

plt.ylabel('Frequency', fontsize=20)

plt.xlabel('Length', fontsize=20)``````

## Moving On

In this lesson, you learned how to create histograms in Python using matplotlib.