In this brief tutorial, you will learn about how to configure Python on your local computer so that you can build machine learning algorithms throughout the rest of this course.
You will also be introduced to scikit-learn
, which is the Python library that we will be using to build machine learning models through the rest of this course.
Table of Contents
You can skip to a specific section of this Python machine learning tutorial using the table of contents below:
- How to Install and Use Jupyter Notebooks
- How to Install
scikit-learn
- Introduction to
scikit-learn
- Final Thoughts
How to Install and Use Jupyter Notebooks
A Jupyter Notebook is a file (and corresponding application) that provides a nice environment for you to write and execute Python code. The Jupyter Notebook is arguably the most popular environment used by machine learning engineers.
The easiest way to install the Jupyter Notebook application is by downloading the Anaconda distribution of Python. Please follow the instructions in the following tutorial to do so:
If you've never worked with a Jupyter notebook before, you'll need to learn how to operate in this environment. The following tutorial will be useful for you:
If you're an experienced Python developer, please note that you do not necessarily need to work in a Jupyter Notebook to be successful in this course. However, all of the screenshots, examples, and practice problems will assume that you're working from a Jupyter Notebook. Keep this in mind before proceeding using a different Python editor or programming environment.
How to Install scikit-learn
scikit-learn
is the Python library that we will be using to build machine learning models in this course. Accordingly, you'll need to install the scikit-learn
library on your computer before proceeding.
If you installed Python using the Anaconda distribution, then scikit-learn
will already be installed. If not, you can install scikit-learn
using the following command line prompt:
pip install scikit-learn
If you're working with the Anaconda distribution but scikit-learn
isn't installed for some reason, you can install is by running the following statement from the command line:
conda install scikit-learn
Introduction to scikit-learn
To conclude this tutorial, I wanted to provide a brief introduction to the scikit-learn
library in Python.
Earlier in this course, you learned that building machine learning models generally follows this recipe:
- Data acquisition
- Data cleaning
- Splitting the data set into training data, validation data, and test data
- Training the model on the training data
- Validating and tweaking the model using the validation data
- Testing the model's final performance using the test data
scikit-learn
provides tools for each step of this process. We will explore each of these tools quickly in this section.
Before proceeding, please note that this tutorial is intended to be nothing but a quick introduction. Don't worry about understanding every concept introduced in this tutorial, because we'll be learning about each step in much more detail later.
First, let's discuss how we import models from scikit-learn
. Every algorithm is exposed in scikit-learn
using something called an estimator
. In scikit-learn
, an estimator
is any object that learns from data. A scikit-learn
estimator usually falls into one of three categories: classification, regression, or clustering.
The first step of importing an estimator is importing the model. The generalized Python command for importing a model is:
from sklearn.family import Model
where:
family
is the model family that the model you're importing is fromModel
is the name of the specific model you're importing.
As an example, the LinearRegression
model is part of the linear_model
family. Here is the command you would use to import this model into your Python script:
from sklearn.linear_model import LinearRegression
Next, you need to run the model estimator and pass in the required parameters. You can use Shift + Tab
in the Jupyter Notebook to generate a list of the required arguments for a specific model.
As an example, here are the arguments required for the LinearRegression
model:
LinearRegression(copy_X = True, fit_intercept = True, normalize = True)
Most commonly, we'll create an instance of the LinearRegression
object and assign it to a variable named model
:
model = LinearRegression(copy_X = True, fit_intercept = True, normalize = True)
It is not time to fit this model on some training data! Remember that it is important to split our model into both training data and test data. Let's see how to do this.
First, let's generate a fake data set:
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split
x = np.arange(10).reshape((5,2))
y = range(5)
After running this code, here is the value assigned to x
:
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
Similarly, here is the value assigned to y
:
range(0, 5)
If you're wondering what this data means, let's explain. The x
variable holds the actual observations from the data set, which has 2 different characteristics and 5 observations. The y
variable contains labels (which our model will attempt to predict) of the data set.
Now we'll split the data we generated into training data and test data using the train_test_split
function contained in scikit-learn
.
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.4)
The test_size
parameter of the train_test_split
function is important. It ranges from 0.0
to 1.0
and represents the proportion of the data set to include in the test data.
After running this code, here's what's contained in x_training_data
:
array([[0, 1],
[2, 3],
[8, 9]])
Similarly, here are the other three data sets:
x_test_data
array([[6, 7],
[4, 5]])
y_training_data
[0, 1, 4]
y_test_data
[3, 2]
Let's move on to actually fitting our model to our training data. This is done by using the model.fit()
method by passing in the training data.
Here's the code to do this:
model.fit(x=x_training_data, y=y_training_data)
Our model has been trained and we can now use it to make predictions on our data set. We do this using the predict()
method, passing in our x_test_data
as the only parameter.
model.predict(x_test_data)
Here's what this code returns:
array([3., 2.])
You can then compare these predicted values to the actual values in the data set to assess the performance of your model.
Final Thoughts
In this tutorial, we quickly discussed the tooling required for you to proceed through this course. You also had your first brief introduction to the Python library scikit-learn
, which we will be using to build machine learning models through the rest of this course.
Here is a brief summary of what you learned in this lesson:
- How to download and run Jupyter Notebooks
- How to install
scikit-learn
(and why you don't need to if you installed Python using the Anaconda distribution) - A brief summary of how machine learning models are built using the
scikit-learn
package. Please note that you need not understand every detail from this overview, since we will be revisiting every step in more detail later.