The first type of machine learning that you will encounter in this course is supervised machine learning.
This lesson serves as a broad overview of supervised machine learning. Later on in this course, we’ll dive into the technical applications of this machine learning method in Python.
Table of Contents
You can skip to a specific section of this Python machine learning tutorial using the table of contents below:
Introduction to Supervised Machine Learning
Supervised machine learning algorithms are algorithms that we train using labelled examples.
This means that the data set has both input variables as well as output variables where we know the output variable’s value in advance.
Here are a few examples:
- Spam emails vs. legitimate emails
- Photos that contain cats vs. photos that do not contain cats
In general, you can use supervised machine learning techniques to solve these problems because the data always has input-output pairs. In other problem types, the output is not known in advance. You’ll learn more about this later.
There are two broad types of supervised machine learning problems:
- Classification problems
- Regression problems
We will explore both problem types in detail in the next two lessons.
How is Supervised Machine Learning Used?
Data scientists use supervised machine learning in applications where historical data is likely to predict future events.
Supervised machine learning is a complex process. It contains many different steps.
More specifically, here are the broad steps of a supervised machine learning algorithm:
- Collect your data: This step is relatively self-explanatory.
- Clean your data: The data we collect is often in a form that is difficult to work with. Data cleaning is the process of modifying our data so that it exists in a format that our machine learning algorithm can work with. Data cleaning is surprisingly difficult and can often take longer than the original data collection process.
- Split the data: Supervised machine learning algorithms always divide a data set into two categories: test data and training data. You will learn more about this process later. For now, it is sufficient for you to know that the training data set is usually much larger than the test data set.
- Model training: We use the training data set from the previous step to train our machine learning algorithm.
- Model testing: We use our test data set to test the accuracy of our model. Depending on the outcome of this step, we may move back a few steps to modify our model.
This is a simplified version of the process that we’ll use to build supervised machine learning algorithms. However, having a broad understanding is useful before we examine each step in detail.
One important caveat worth mentioning is that we often split the data set into more than two categories (although we will generally be sticking to the simple training-test split in this course). More precisely, there are generally three types of data subcategories:
- The Training Data: used to originally train the supervised machine learning algorithm.
- The Validation Data: used to determine what model parameters to adjust to improve its performance.
- The Test Data: used to calculate some conclusive performance metric about the model.
Note that you can test the model on the validation data (and make modifications afterwards) more than once. Conversely, you shouldn’t make any modifications after you run the model on the test data and calculate the model’s performance metrics.
In this tutorial, you had your first theoretical introduction to supervised machine learning.
Here is a broad summary of what you learned in this lesson:
- How the use of labelled input-output data is the defining characteristic of a supervised machine learning algorithm
- The types of problems that are typically tackled with supervised machine learning
- How all supervised machine learning problems can be categorized as either classification problems or regression problems
- The broad steps required to implement a supervised machine learning algorithm
- Why supervised machine learning data sets are split into training data, validation data, and test data