Exploratory data analysis (EDA) is used to analyze a data set to assess its nature and characteristics. EDA primarily uses visualization techniques to do this.
The objectives of EDA are
To understand the distribution of the dataset
To identify potential patterns and trends
To identify outliers or anomalies
To produce a hypothesis
To validate assumptions
To generate a visually-summarized representation of the dataset
Why is EDA Important?
Although many tend to skip it, exploratory data analysis is an essential step in any data science project. It helps you understand whether the model you’re applying to the data will be logical, well-defined, and scoped within the boundaries of its intended application.
EDA also helps to find anomalies, errors, noise, loss of data, and the overall validity of the data set with respect to the application.
Performing EDA helps you in modeling outcomes, determining the best-suited algorithm, finding new data patterns, and even defining which variables are dependent and independent.
Using proper exploratory data analysis techniques can save you from building invalid or erroneous models, building models on unsuitable data, selecting unsuitable variables, and even unoptimized development of the model. EDA can be helpful in preparing data for the model as well (this is typically called data cleaning).
EDA can also save you from making bad predictions by revealing that your data set is not adequate for the task at hand. This means you’ll need to go back to the data collection stage.
To summarize, the main goal of EDA is to ensure that the dataset aligns with the model that you’re building. The rest of this article will explain a few methods you can use to perform exploratory data analyses.
Methods to Perform EDA
There are many methods that can be used for exploratory data analysis.
Note that as with everything else, it’s often best to use multiple EDA techniques. You can then compare the results of each analysis to see if they align with each other.
This method can be used to produce a statistical summary for each separate column of the unprocessed dataset.
This method is used to determine the connection between each independent variable and the intended dependent variable.
This method can be used to understand the relationship between different columns of the dataset.
In simple terms, dimensionality reduction can be used to reduce the number of dimensions in a dataset and transform to another dimensional space that is lower than the original while still sustaining the meaningful and resourceful characteristics of the dataset.
Dimensionality reduction methods can be used to develop questions to answer during your analysis or to develop a sense of how the results of this model should be interpreted. The usual procedure to perform exploratory data analysis in your code is as follows.
Acquire a suitable dataset and import it
Understand the nature of the dataset by observing data from different rows.
Try to further understand the nature of the dataset by querying the data
NumPy is one of the most versatile libraries intended towards mathematical and numerical computations on matrices of various dimensions. It’s NumPy array data structure is similar to a Python list, but is more well-suited towards high-performance computing.
The last line of this code block will output the data as follows.
If we look at the fields within this data set, we will see that there are 8 columns (in addition to the index column).
For this exercise, you don’t need to know what each column means from the start. Just know that they are variables. You will get familiar with each specific column as you work through this tutorial.
If you want to learn the shape of the data set, you can easily run the following command to get the number of columns and rows.
Here’s the output of this command:
This means the above dataset has 156 observations and 8 features.
In addition to that, by running the following head and tail statements, you can print the first five rows and the last five rows from the dataset.
Similarly, here’s how you can print the tail of the data set:
As we noticed in the first five observations, the data set contains null values. It’s important to understand how prevalent these null values are in the rest of the dataset.
The info method is an excellent tool for learning more about missing data in a pandas DataFrame.
The output of the info method tells us that except for the date field, all the other fields have float values.
Moreover, except for in the date column, every other field has NaN values.
We have a few options on how to deal with this missing data:
go ahead with the existing table
replace the NaN values with another value
discard all the observations that have NaN values
In this scenario, let’s choose the last option and drop all the observations with NaN fields. We’ll use the pandas dropna method to do this:
Here’s what the new DataFrame looks like:
To see how many rows have been removed from our data set, we can look at the DataFrame’s shape attribute:
Here’s the output of this code:
You can tell that our data set has dropped from 156 observations originally to 119 observations now. This shows that there were 37 observations with missing data in the original data set.
You’ll also notice that the DataFrame’s indices start from 37 and have lost their consistency. You can reset the indices of the DataFrame using the reset_index method, like this:
This will reset the indices of the DataFrame. Note that we’ve passed a drop = True argument into the reset_index method because we do not require the old indices to be included in the new DataFrame as another column.
We can now use the pandas.describe() function to look at a summary of all the fields using statistics like means, percentiles, standard deviation, minimums, and maximums.
If you want to know the median of a particluar column, it’s not explicitly labelled here. With that said, the median is just another name for the 50th percentile.
You can see from this output that the difference between the 50th percentile and the 75th percentile of txCount field is much greater than the difference between the 25th percentile and the 50th percentile.
The describe method also shows similar characteristics for some of the other fields. In many columns, the mean is significantly higher than its median.
Both these observations suggest that there are outliers in the dataset.
Provided that this dataset had lots of similar values or repeated values, we could have used the unique() function to observe all the unique values from each field. In addition, the value_counts() method shows the number of times each value in a column occurs.
Take your own time and try to observe more patterns in the dataset. There could be hundreds of different patterns that are just waiting to be found!
Now that we are a bit familiar with the nature of this dataset, we can produce a correlation matrix for this dataset using the corr() function.
Correlation matrices are highly useful in summarizing massive amounts of data and to identify patterns among the different fields. They also allow us to perform diagnostics to find if certain variables are extremely correlated to each other. This indicates multicollinearity and would suggest that some types of modelling (such as linear regressions) would produce unreliable predictions.
Here’s the code to generate a correlation matrix from the DataFrame:
Here’s the output of this code:
You’ll also want to use the seaborn library we installed earlier to generate some visualizations. Let’s start by importing the library into our script:
One of my favorite seaborn visualizations is the heatmap. Here’s the code to generate a basic heatmap using seaborn that presents our correlation matrix in a visually-appealing manner:
This visualization reveals that txCount has a very strong positive correlation with activeAddresses.
Why is this?
If we consider what txCount actually is, it is a field that denotes the number of blockchain transactions that a specific user has completed. The field activeAddresses represents the number of individual addresses that have engaged in blockchain transactions with that specific user. With this in mind, it makes sense that these two fields have a high correlation.
There are many more such patterns in this data set. Try and identify more of the. It’s a great way to practice your exploratory data analysis skills!
You can also look at how the price has changed over the available dates. Let’s first see if there are any duplicate dates. As you already know, there are 119 observations. Let’s produce the array of unique values in the date field and find its length:
Since then number of unique dates is equal to the number of total dates, this shows that all dates are unique.
It also shows that the dates are in chronological order even with the NaN values dropped.
From the description we produced earlier, we know that the mean for active addresses is approximately 204.
We also observed that on some dates, there are numbers as high as 4270 and numbers as low as 16.
The charts we just created allow us to observe those outliers in a visual way.
There are many more patterns in this data set that are best observed visually. Feel free to produce more plots involving other fields that might be useful in inferring important information in this exploratory data analysis step.
Exploratory data analysis is used to summarize and analyze data using techniques like tabulation and visualization.
Using EDA in statistical modeling can be extremely useful in determining the best-suited model for the problem at hand. Moreover, it also helps in identifying unseen patterns in a dataset.
This tutorial provided a broad overview of exploratory data analysis techniques. Please feel free to use it as a reference when approaching new data science problems in the future.