In this brief tutorial, you will learn about how to configure Python on your local computer so that you can build machine learning algorithms throughout the rest of this course.
You will also be introduced to
scikit-learn, which is the Python library that we will be using to build machine learning models through the rest of this course.
You can skip to a specific section of this Python machine learning tutorial using the table of contents below:
- How to Install and Use Jupyter Notebooks
- How to Install
- Introduction to
- Final Thoughts
A Jupyter Notebook is a file (and corresponding application) that provides a nice environment for you to write and execute Python code. The Jupyter Notebook is arguably the most popular environment used by machine learning engineers.
The easiest way to install the Jupyter Notebook application is by downloading the Anaconda distribution of Python. Please follow the instructions in the following tutorial to do so:
If you've never worked with a Jupyter notebook before, you'll need to learn how to operate in this environment. The following tutorial will be useful for you:
If you're an experienced Python developer, please note that you do not necessarily need to work in a Jupyter Notebook to be successful in this course. However, all of the screenshots, examples, and practice problems will assume that you're working from a Jupyter Notebook. Keep this in mind before proceeding using a different Python editor or programming environment.
scikit-learn is the Python library that we will be using to build machine learning models in this course. Accordingly, you'll need to install the
scikit-learn library on your computer before proceeding.
If you installed Python using the Anaconda distribution, then
scikit-learn will already be installed. If not, you can install
scikit-learn using the following command line prompt:
pip install scikit-learn
If you're working with the Anaconda distribution but
scikit-learn isn't installed for some reason, you can install is by running the following statement from the command line:
conda install scikit-learn
To conclude this tutorial, I wanted to provide a brief introduction to the
scikit-learn library in Python.
Earlier in this course, you learned that building machine learning models generally follows this recipe:
- Data acquisition
- Data cleaning
- Splitting the data set into training data, validation data, and test data
- Training the model on the training data
- Validating and tweaking the model using the validation data
- Testing the model's final performance using the test data
scikit-learn provides tools for each step of this process. We will explore each of these tools quickly in this section.
Before proceeding, please note that this tutorial is intended to be nothing but a quick introduction. Don't worry about understanding every concept introduced in this tutorial, because we'll be learning about each step in much more detail later.
First, let's discuss how we import models from
scikit-learn. Every algorithm is exposed in
scikit-learn using something called an
estimator is any object that learns from data. A
scikit-learn estimator usually falls into one of three categories: classification, regression, or clustering.
The first step of importing an estimator is importing the model. The generalized Python command for importing a model is:
from sklearn.family import Model
familyis the model family that the model you're importing is from
Modelis the name of the specific model you're importing.
As an example, the
LinearRegression model is part of the
linear_model family. Here is the command you would use to import this model into your Python script:
from sklearn.linear_model import LinearRegression
Next, you need to run the model estimator and pass in the required parameters. You can use
Shift + Tab in the Jupyter Notebook to generate a list of the required arguments for a specific model.
As an example, here are the arguments required for the
LinearRegression(copy_X = True, fit_intercept = True, normalize = True)
Most commonly, we'll create an instance of the
LinearRegression object and assign it to a variable named
model = LinearRegression(copy_X = True, fit_intercept = True, normalize = True)
It is not time to fit this model on some training data! Remember that it is important to split our model into both training data and test data. Let's see how to do this.
First, let's generate a fake data set:
from sklearn.linear_model import LinearRegression import numpy as np from sklearn.model_selection import train_test_split x = np.arange(10).reshape((5,2)) y = range(5)
After running this code, here is the value assigned to
array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
Similarly, here is the value assigned to
If you're wondering what this data means, let's explain. The
x variable holds the actual observations from the data set, which has 2 different characteristics and 5 observations. The
y variable contains labels (which our model will attempt to predict) of the data set.
Now we'll split the data we generated into training data and test data using the
train_test_split function contained in
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.4)
test_size parameter of the
train_test_split function is important. It ranges from
1.0 and represents the proportion of the data set to include in the test data.
After running this code, here's what's contained in
array([[0, 1], [2, 3], [8, 9]])
Similarly, here are the other three data sets:
x_test_data array([[6, 7], [4, 5]]) y_training_data [0, 1, 4] y_test_data [3, 2]
Let's move on to actually fitting our model to our training data. This is done by using the
model.fit() method by passing in the training data.
Here's the code to do this:
Our model has been trained and we can now use it to make predictions on our data set. We do this using the
predict() method, passing in our
x_test_data as the only parameter.
Here's what this code returns:
You can then compare these predicted values to the actual values in the data set to assess the performance of your model.
In this tutorial, we quickly discussed the tooling required for you to proceed through this course. You also had your first brief introduction to the Python library
scikit-learn, which we will be using to build machine learning models through the rest of this course.
Here is a brief summary of what you learned in this lesson:
- How to download and run Jupyter Notebooks
- How to install
scikit-learn(and why you don't need to if you installed Python using the Anaconda distribution)
- A brief summary of how machine learning models are built using the
scikit-learnpackage. Please note that you need not understand every detail from this overview, since we will be revisiting every step in more detail later.