Pandas loc vs. iloc for Accessing Data in Python

Python is widely considered the best programming language for data science.

This is largely because of its rich ecosystem. Python has countless open-source libraries that make it quick and easy to integrate common functionality into your applications.

Pandas is one of these libaries. Known for its tabular data structure the Pandas DataFrame - which is similar to a spreadsheet that you can manipulate in Python - Pandas is the de-facto tool for working tabular data.

In this tutorial, we will get ourselves familiar with two of the widely used functions to select and filter data from Pandas dataframes: loc and iloc.

Table of Contents

You can skip to a specific section of this pandas loc vs. iloc tutorial using the table of contents below:

Installing Pandas and Accessing a Data Set

Pandas is an open-source Python library. This means that anyone can use (and contribute to) it.

In case you do not have Pandas, you can install it by running the following statement from your command line:

pip install pandas

We'll need a dataset in order to get ourselves familiar with the functions we will be going over today. In this tutorial, we'll be using the World Happiness Report 2015 dataset from Kaggle. Fee free to use your own dataset if you'd like.

Let’s download a dataset of your choice and import to our script.

import pandas as pd
df = pd.read_csv('dataset.csv')

If we look at the shape of the dataset, we can see that it has 158 rows and 12 columns.

df.shape

This generates:

(158, 12)

Let’s take a look at the first 5 rows of the dataset using pandas' head method:

df.head()

The first 5 rows of our data set

Now we are ready to get started. We'll now look at the different ways of selecting and filtering data from the dataset.

Selecting Data from Dataframes: A Rudimentary Approach

Selecting Columns by Column Name(s)

Let’s retrieve the first five rows of the columns Country and Happiness Score. To do this, we'll first select the two columns using square bracket parsing and then call the head method on those columns:

df[['Country','Happiness Score']].head()

This code generates:

The first 5 rows of the country and happiness score columns

In this code, we specify a list of the columns we want to retrieve. We can add more columns to the output DataFrame by appending the required column names to the list.

Selecting Columns by Data Type(s)

Let's now consider how we can retrieve columns by their datatype. To do this, we pass a list of the datatypes of the columns we require as the include parameter to the select_dtypes() function.

Let’s retrieve all string values from our columns and print their first 5 rows using the head method:

df.select_dtypes(include = ['object']).head()

This code generates:

All string columns of our data set

Instead of ‘str’, we are using the datatype ‘object’ here. We will now retrieve both string and float values.

df.select_dtypes(include = ['object', 'float']).head()

This code generates:

All string and float columns of our data set

Since the whole dataset only consists of the above two data types, it will output all of the data set's columns.

Selecting Columns by Filtering

We can also use the filter() function to retrieve data from columns that have a substring of our choice in its column name.

df.filter(like='Score').head()

This generates:

Columns with a specific substring

We will also receive multiple columns if the substring of choice is contained in any of the other column names.

Now that we have a fair idea about how to retrieve data from a dataframe, we will next look at two of the most versatile functions built into Pandas: iloc and loc.

Selecting Data from Dataframes: iloc

iloc in Pandas is used to make selections based on integer (denoted by i in iloc) positions or indices. That means we can retrieve data by using the position at which its rows and columns are present in the dataframe.

We can visualize that the rows and columns of a dataframe are numbered from 0. This is the logic used to retrieve data using iloc.

Let us go through an example first.

We will pick the first row of the dataset.

df.iloc[[0]]

This generates:

The first row of our data set

We can also pick multiple rows using the same method.

df.iloc[[0,25,30]]

This generates:

3 different rows from our data set

We can also pick the last row without knowing how many rows there really are. To do this, pass the index -1 into the pandas iloc function:

df.iloc[[-1]]

This generates:

The last row of our pandas DataFrame

We can select columns from our DataFrame using similar logic.

If we look at the structure of iloc, it's designed so we specify the intended rows we plan to receive in the first list and the intended columns in the second list. Although this might look tricky and complex, selecting a column really is as simple as retrieving a row - just specific : for your list of rows!

Here's an example:

df.iloc[ :, [0] ].head()

This generates:

The first column of our pandas DataFrame

We can also retrieve the last column by passing -1 to the list as we did before.

df.iloc[ :, [-1] ].head()

This generates:

The last column of our pandas DataFrame

We can combine both the row and columns lists we learned to retrieve data elements from various positions of the dataset. Let’s say we want to retrieve the first cell, which happens to be the data element in the first column and the first row.

We can retrieve the first cell using the following statement:

df.iloc[ [0], [0] ]

This generates:

The first cell of our pandas DataFrame

We can play around with this trying to retrieve various data elements. Let’s retrieve the Happiness Rank and Happiness Score for the 0th, 25th, and the last observations of the dataset. We can identify that the Happiness Rank and the Happiness columns are the 2nd and the 3rd columns of our dataframe.

We can retrieve these data with the following code:

df.iloc[[ 0 , 25 , -1 ],[ 2 , 3 ]]

This generates:

Random data from our data set

If we are to select rows and columns between a range, we can treat the indexer the same way we treat arrays.

Let’s select Happiness Rank and Happiness Score data from the rows 30 to 34.

df.iloc[ 30:35, [ 2 , 3 ]]

This generates:

Data from certain rows of our pandas DataFrame

One of the important things to keep in mind is that when we specify the range, it always runs from the starting number to the number right before the interval ending number. Said differently, the first number is inclusive while the second number is exclusive. For instance, if we say [0, 5], it means that we are considering a range that runs from 0 to 4.

Now that we have familiarized ourselves with iloc, let us look at the next indexer in line: loc.

Selecting Data from Dataframes: loc

loc in Pandas can be used in two main ways.

We can use loc to retrieve data elements based on their indices or labels, and we can use loc to retrieve data using conditional statements.

We'll look at each of these separately next.

Index-Based Data Retrieval

If we recall what we learned in the last section, we understand that we used iloc to retrieve data based on their position or the location in the dataframe we are working with.

For instance, if we execute df.iloc[[0]], it will output the first row from the dataframe. Even if the index of that row was a different number than 0, it would have returned the first row from the dataframe as we would perceive it.

However, when it comes to loc indexer, it works a little differently than the iloc. It will always call the data elements by their specified index. Let’s consider an example.

df.loc[[0]]

This generates:

The first row of our DataFrame

The result was the same result that was produced by iloc earlier in the previous section. This is because the DataFrame's index is equal to the integer position of each row in the DataFrame.

Let's change the index of the dataframe to a different field. More specifically, let's set it to Happiness Rank.

df.set_index('Happiness Rank', inplace=True)
df

This generates:

Changing the index of our DataFrame

We can see that the index has been changed to Happiness Rank and now it starts from 1 opposed to the original dataframe that started with 0.

Let us now run iloc and loc again.

  • iloc
df.iloc[[0]]

This generates:

A pandas DataFrame

As we can see, the result is the same as before except for the index, which is 1 now.

Let us run the same for loc.

  • loc
df.loc[[0]]

This generates:

KeyError: "None of [Int64Index([0], dtype='int64', name='Happiness
Rank')] are in the [index]"

As we can see, running the same statement with the loc indexer gives us a KeyError exception.

If we go back to the current df version, we will see that this happens because there is no index with 0 in the updated dataframe since we changed the index field to Happiness Rank which starts from 1 instead of 0 as before.

We can run the following with the updated index instead to retrieve the same element as before.

df.loc[[1]]

This generates:

Parsing a pandas DataFrame

Let's change the index to the Country field now. However, we have to make sure that we reset the index so that the Happiness Rank is normal column, as before.

df.reset_index(level=0, inplace=True)
df

Now we can see that the Happiness Rank is back as a field:

A pandas DataFrame after resetting the index

We can change the place of the column to be the same as before. However, we will not spend time on that as it is not a necessity here when we are using loc. Let’s set the index to the Country field now.

df.set_index('Country', inplace=True)
df

This generates:

a pandas DataFrame after resetting the index

Let us run iloc to retrieve the first row.

df.iloc[[0]]

Output:

a pandas DataFrame

If we are to retrieve the same data using loc, we will have to use the index value. In this case, it will be ‘Switzerland’.

df.loc[['Switzerland']]

This generates:

a pandas DataFrame

With that, now we know how differently iloc and loc behaves when they are retrieving elements.

We will now look at a few use cases that we can perform on the updated dataframe with the index as the Country.

Let’s retrieve data for two countries, Denmark and Panama.

df.loc[['Denmark','Panama']]

This generates:

Two columns of a pandas DataFrame

If we are writing functions for our custom programs to retrieve data based on the index, instead of querying explicitly, we can set the index as the field we want and retrieve the information using loc this way.

Let’s retrieve the Region and Happiness Score of Australia and Sweden.

df.loc[ ['Australia','Sweden'], ['Region', 'Happiness Score'] ]

This generates:

A subset of a pandas DataFrame

Instead of typing the column names, we can also pass the column index from df.columns. Before we do that, let’s get ourselves familiar with the df.columns property.

df.columns

This generates:

Index(['Happiness Rank', 'Region', 'Happiness Score', 'Standard Error', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual'], dtype='object')

As you can see, this is a pandas Index object - which is similar to a Python list - that contains the column names from the pandas DataFrame.

We can access the column names using the index as follows.

df.columns[0]

This returns:

'Happiness Rank'

Let’s now retrieve the Region and Happiness score as earlier from the same two countries, Australia and Sweden.

df.loc[ ['Australia','Sweden'], [df.columns[1],df.columns[2]] ]

This generates:

A subset of a pandas DataFrame

We will now look at how to retrieve data based on different conditions.

Condition-Based Data Retrieval

The usual and logical approach for us to retrieve a certain data element depending on its value could be a problem that can be solved using for loops and if statements.

Let’s say we are trying to find the rows with Health (Life Expectancy) above 0.95. Instead of using for loops, we can find these in a much more optimized way using the loc indexer.

df.loc[ df['Health (Life Expectancy)']>0.95 ]

This generates:

Conditional retrieval from a pandas DataFrame

Let’s say we only want the Happiness Rank and the Generosity of these countries. We can easily include the required column parameters as follows.

df.loc[df['Health (Life Expectancy)']>0.95 , ['Happiness Rank', 'Generosity']]

This generates:

conditional retrieval pandas DataFrame

We can also have multiple conditions specifying what data we are looking for. Let us retrieve countries above 0.95 Happiness Scores and also are from the Eastern Asian region.

df.loc[(df['Health (Life Expectancy)'] > 0.95) & (df['Region'] == 'Eastern Asia')]

This outputs:

multiple conditions in a pandas DataFrame

Now that we have covered the basics and are familiar with the functionality of loc and iloc indexers, we will now take a look at their data types.

Most Python developers tend to think that whenever we receive a table as an output, it necessarily has to be a DataFrame. However, this is not actually true.

In reality, when we are retrieving tabular data, although they may look alike, their actual type can be different. We will now take a look at a few of the types of Pandas objects iloc and loc indexers produce.

Let’s consider the following.

df.iloc[0]

Notice that we have used the value straightaway instead of passing a list with the value.

A horizontal pandas DataFrame

In this, we realize that instead of printing the dataframe as a horizontal table, we can also access it as a pandas Series and have a link to individual values based on indices.

Let us confirm this by executing,

type(df.iloc[0])

This returns:

pandas.core.series.Series

Let us observe what happens if we do the same by passing a list with the index.

type(df.iloc[[0]])

This returns:

pandas.core.frame.DataFrame

Although these may seem trivial, it is at the real-world implementations of these functions and their basics we learn that will be useful.

Final Thoughts

The Pandas library is one of the most important components of the data science ecosystem. Within pandas, loc and iloc are two of the most important functions.

This tutorial showed you how to use both functions in Python. Feel free to use this as a reference in your future data science projects.


Written on August 8th, 2020