Pandas loc vs. iloc for Accessing Data in Python
Python is widely considered the best programming language for data science.
This is largely because of its rich ecosystem. Python has countless open-source libraries that make it quick and easy to integrate common functionality into your applications.
Pandas is one of these libaries. Known for its tabular data structure the Pandas DataFrame - which is similar to a spreadsheet that you can manipulate in Python - Pandas is the de-facto tool for working tabular data.
In this tutorial, we will get ourselves familiar with two of the widely used functions to select and filter data from Pandas dataframes: loc
and iloc
.
Table of Contents
You can skip to a specific section of this pandas loc
vs. iloc
tutorial using the table of contents below:
Installing Pandas and Accessing a Data Set
Pandas is an open-source Python library. This means that anyone can use (and contribute to) it.
In case you do not have Pandas, you can install it by running the following statement from your command line:
pip install pandas
We'll need a dataset in order to get ourselves familiar with the functions we will be going over today. In this tutorial, we'll be using the World Happiness Report 2015 dataset from Kaggle. Fee free to use your own dataset if you'd like.
Let’s download a dataset of your choice and import to our script.
import pandas as pd
df = pd.read_csv('dataset.csv')
If we look at the shape of the dataset, we can see that it has 158 rows and 12 columns.
df.shape
This generates:
(158, 12)
Let’s take a look at the first 5 rows of the dataset using pandas' head
method:
df.head()
Now we are ready to get started. We'll now look at the different ways of selecting and filtering data from the dataset.
Selecting Data from Dataframes: A Rudimentary Approach
Selecting Columns by Column Name(s)
Let’s retrieve the first five rows of the columns Country
and Happiness Score
. To do this, we'll first select the two columns using square bracket parsing and then call the head
method on those columns:
df[['Country','Happiness Score']].head()
This code generates:
In this code, we specify a list of the columns we want to retrieve. We can add more columns to the output DataFrame by appending the required column names to the list.
Selecting Columns by Data Type(s)
Let's now consider how we can retrieve columns by their datatype. To do this, we pass a list of the datatypes of the columns we require as the include parameter to the select_dtypes()
function.
Let’s retrieve all string values from our columns and print their first 5 rows using the head
method:
df.select_dtypes(include = ['object']).head()
This code generates:
Instead of ‘str’, we are using the datatype ‘object’ here. We will now retrieve both string and float values.
df.select_dtypes(include = ['object', 'float']).head()
This code generates:
Since the whole dataset only consists of the above two data types, it will output all of the data set's columns.
Selecting Columns by Filtering
We can also use the filter()
function to retrieve data from columns that have a substring of our choice in its column name.
df.filter(like='Score').head()
This generates:
We will also receive multiple columns if the substring of choice is contained in any of the other column names.
Now that we have a fair idea about how to retrieve data from a dataframe, we will next look at two of the most versatile functions built into Pandas: iloc
and loc
.
Selecting Data from Dataframes: iloc
iloc
in Pandas is used to make selections based on integer (denoted by i in iloc) positions or indices. That means we can retrieve data by using the position at which its rows and columns are present in the dataframe.
We can visualize that the rows and columns of a dataframe are numbered from 0. This is the logic used to retrieve data using iloc.
Let us go through an example first.
We will pick the first row of the dataset.
df.iloc[[0]]
This generates:
We can also pick multiple rows using the same method.
df.iloc[[0,25,30]]
This generates:
We can also pick the last row without knowing how many rows there really are. To do this, pass the index -1
into the pandas iloc
function:
df.iloc[[-1]]
This generates:
We can select columns from our DataFrame using similar logic.
If we look at the structure of iloc, it's designed so we specify the intended rows we plan to receive in the first list and the intended columns in the second list. Although this might look tricky and complex, selecting a column really is as simple as retrieving a row - just specific :
for your list of rows!
Here's an example:
df.iloc[ :, [0] ].head()
This generates:
We can also retrieve the last column by passing -1 to the list as we did before.
df.iloc[ :, [-1] ].head()
This generates:
We can combine both the row and columns lists we learned to retrieve data elements from various positions of the dataset. Let’s say we want to retrieve the first cell, which happens to be the data element in the first column and the first row.
We can retrieve the first cell using the following statement:
df.iloc[ [0], [0] ]
This generates:
We can play around with this trying to retrieve various data elements. Let’s retrieve the Happiness Rank and Happiness Score for the 0th, 25th, and the last observations of the dataset. We can identify that the Happiness Rank and the Happiness columns are the 2nd and the 3rd columns of our dataframe.
We can retrieve these data with the following code:
df.iloc[[ 0 , 25 , -1 ],[ 2 , 3 ]]
This generates:
If we are to select rows and columns between a range, we can treat the indexer the same way we treat arrays.
Let’s select Happiness Rank and Happiness Score data from the rows 30 to 34.
df.iloc[ 30:35, [ 2 , 3 ]]
This generates:
One of the important things to keep in mind is that when we specify the range, it always runs from the starting number to the number right before the interval ending number. Said differently, the first number is inclusive while the second number is exclusive. For instance, if we say [0, 5], it means that we are considering a range that runs from 0 to 4.
Now that we have familiarized ourselves with iloc
, let us look at the next indexer in line: loc
.
Selecting Data from Dataframes: loc
loc
in Pandas can be used in two main ways.
We can use loc
to retrieve data elements based on their indices or labels, and we can use loc
to retrieve data using conditional statements.
We'll look at each of these separately next.
Index-Based Data Retrieval
If we recall what we learned in the last section, we understand that we used iloc
to retrieve data based on their position or the location in the dataframe we are working with.
For instance, if we execute df.iloc[[0]]
, it will output the first row from the dataframe. Even if the index of that row was a different number than 0, it would have returned the first row from the dataframe as we would perceive it.
However, when it comes to loc indexer, it works a little differently than the iloc. It will always call the data elements by their specified index. Let’s consider an example.
df.loc[[0]]
This generates:
The result was the same result that was produced by iloc earlier in the previous section. This is because the DataFrame's index is equal to the integer position of each row in the DataFrame.
Let's change the index of the dataframe to a different field. More specifically, let's set it to Happiness Rank.
df.set_index('Happiness Rank', inplace=True)
df
This generates:
We can see that the index has been changed to Happiness Rank and now it starts from 1
opposed to the original dataframe that started with 0
.
Let us now run iloc
and loc
again.
iloc
df.iloc[[0]]
This generates:
As we can see, the result is the same as before except for the index, which is 1 now.
Let us run the same for loc
.
loc
df.loc[[0]]
This generates:
KeyError: "None of [Int64Index([0], dtype='int64', name='Happiness
Rank')] are in the [index]"
As we can see, running the same statement with the loc indexer gives us a KeyError exception.
If we go back to the current df version, we will see that this happens because there is no index with 0 in the updated dataframe since we changed the index field to Happiness Rank which starts from 1 instead of 0 as before.
We can run the following with the updated index instead to retrieve the same element as before.
df.loc[[1]]
This generates:
Let's change the index to the Country field now. However, we have to make sure that we reset the index so that the Happiness Rank is normal column, as before.
df.reset_index(level=0, inplace=True)
df
Now we can see that the Happiness Rank is back as a field:
We can change the place of the column to be the same as before. However, we will not spend time on that as it is not a necessity here when we are using loc. Let’s set the index to the Country field now.
df.set_index('Country', inplace=True)
df
This generates:
Let us run iloc to retrieve the first row.
df.iloc[[0]]
Output:
If we are to retrieve the same data using loc, we will have to use the index value. In this case, it will be ‘Switzerland’.
df.loc[['Switzerland']]
This generates:
With that, now we know how differently iloc and loc behaves when they are retrieving elements.
We will now look at a few use cases that we can perform on the updated dataframe with the index as the Country.
Let’s retrieve data for two countries, Denmark and Panama.
df.loc[['Denmark','Panama']]
This generates:
If we are writing functions for our custom programs to retrieve data based on the index, instead of querying explicitly, we can set the index as the field we want and retrieve the information using loc
this way.
Let’s retrieve the Region and Happiness Score of Australia and Sweden.
df.loc[ ['Australia','Sweden'], ['Region', 'Happiness Score'] ]
This generates:
Instead of typing the column names, we can also pass the column index from df.columns
. Before we do that, let’s get ourselves familiar with the df.columns property.
df.columns
This generates:
Index(['Happiness Rank', 'Region', 'Happiness Score', 'Standard Error', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual'], dtype='object')
As you can see, this is a pandas Index object - which is similar to a Python list - that contains the column names from the pandas DataFrame.
We can access the column names using the index as follows.
df.columns[0]
This returns:
'Happiness Rank'
Let’s now retrieve the Region and Happiness score as earlier from the same two countries, Australia and Sweden.
df.loc[ ['Australia','Sweden'], [df.columns[1],df.columns[2]] ]
This generates:
We will now look at how to retrieve data based on different conditions.
Condition-Based Data Retrieval
The usual and logical approach for us to retrieve a certain data element depending on its value could be a problem that can be solved using for loops and if statements.
Let’s say we are trying to find the rows with Health (Life Expectancy) above 0.95. Instead of using for loops, we can find these in a much more optimized way using the loc
indexer.
df.loc[ df['Health (Life Expectancy)']>0.95 ]
This generates:
Let’s say we only want the Happiness Rank and the Generosity of these countries. We can easily include the required column parameters as follows.
df.loc[df['Health (Life Expectancy)']>0.95 , ['Happiness Rank', 'Generosity']]
This generates:
We can also have multiple conditions specifying what data we are looking for. Let us retrieve countries above 0.95 Happiness Scores and also are from the Eastern Asian region.
df.loc[(df['Health (Life Expectancy)'] > 0.95) & (df['Region'] == 'Eastern Asia')]
This outputs:
Now that we have covered the basics and are familiar with the functionality of loc and iloc indexers, we will now take a look at their data types.
Most Python developers tend to think that whenever we receive a table as an output, it necessarily has to be a DataFrame. However, this is not actually true.
In reality, when we are retrieving tabular data, although they may look alike, their actual type can be different. We will now take a look at a few of the types of Pandas objects iloc and loc indexers produce.
Let’s consider the following.
df.iloc[0]
Notice that we have used the value straightaway instead of passing a list with the value.
In this, we realize that instead of printing the dataframe as a horizontal table, we can also access it as a pandas Series and have a link to individual values based on indices.
Let us confirm this by executing,
type(df.iloc[0])
This returns:
pandas.core.series.Series
Let us observe what happens if we do the same by passing a list with the index.
type(df.iloc[[0]])
This returns:
pandas.core.frame.DataFrame
Although these may seem trivial, it is at the real-world implementations of these functions and their basics we learn that will be useful.
Final Thoughts
The Pandas library is one of the most important components of the data science ecosystem. Within pandas, loc
and iloc
are two of the most important functions.
This tutorial showed you how to use both functions in Python. Feel free to use this as a reference in your future data science projects.