Machine Learning

Introduction

For this toolbox exercise you will learn how to teach your computer to learn! Machine learning is a field that sits at the intersection of statistics, data mining, and artificial intelligence.

Tom Mitchell defines what it means for a computer program to learn in the following way:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

This definition highlights a key difference between machine learning and classical statistical methods. That is, machine learning is chiefly concerned with improving future performance based on prior experience. Another key difference with classical statistical methods is that machine learning focuses on the computational efficiency (in both time and space) of algorithms. For instance, an active area of machine learning research is to create algorithms that have computational efficiency properties that work with “big data”.

This toolbox will familiarize you with doing basic supervised machine learning. If you want more detailed reading on the subject, check out “A few useful things to know about machine learning”.

Get set

GitHub Classroom invite

For this toolbox we will be learning how to do machine learning using the very powerful scikit-learn Python module. I really like this library for several reasons:

It has a really clean API that lets you try many different machine learning algorithms with minimal changes to your code.
It has a lot of great built-in algorithms / techniques.
It automates the process of doing fair evaluations of a machine learning algorithm (this is the part that people often get wrong).

To install scikit-learn and related dependencies, execute the following command:

$ conda install matplotlib scikit-learn scipy

Classification Using Scikit-Learn

There are many different problem settings in machine learning. One of the most common is known as supervised classification. In the classification problem setting, the goal is to categorize a piece of data into one of several categories (or classes). This input data could be an image, a piece of text, or an audio clip; basically anything that can be described numerically can serve as the input to a classification algorithm. Here, we will investigate supervised classification whereby the computer learns how to classify new pieces of data by being shown a set of examples consisting of input data and their corresponding category (or class).

To make things more concrete, let’s look at one of the most well-studied classification problems: recognizing images of handwritten digits.

To load the digits and display 10 of the examples, run the display_digits() function in the starter code.

The digit database built into scikit-learn has a total of 1797 examples (there are many databases that are much bigger, the most famous is the MNIST database of handwritten digits). Each digit is a grayscale 8x8 image. Our goal will be to train the computer to categorize 8x8 grayscale images as being either a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 by leveraging this database.

To get started, we will use a very simple classification algorithm called Multinomial Logistic Regression. The basic idea will be to partition our data into two sets. The first set, known as the training set, will be used to train our classifier (that is, we will use it to infer the relationship between the appearance of an 8x8 patch of pixels and the corresponding digit that it represents). The second set, known as the testing set, will be used to evaluate our classifier on data that it has not been trained on. The reason we need these two sets, is that evaluating the performance of the classifier on the data it was trained on may be misleadingly high (since the classifier may be overfitting to the idiosyncrasies of that training set). Here is some code for splitting the data into two sets, training on one, testing on the other, and then reporting the classification accuracy on the testing set.

from sklearn.datasets import *
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
train_size=0.5)
model = LogisticRegression(C=10**-10)
model.fit(X_train, y_train)
print("Train accuracy %f" %model.score(X_train, y_train))
print("Test accuracy %f"%model.score(X_test, y_test))

Learning Curves

Next, you will explore how the amount of training data influences the performance of the learned model. In the previous example we used 50% of the data for the training set and 50% for the testing set (we can see this since we set train_size = 0.5 when calling train_test_split). For this toolbox you will be writing code to systematically vary the training set size versus the testing set size and plot the curve of the resultant performance. You will be repeating each value of train_size 10 times to smooth out variability. We have given you starter code for this in the train_model function in machine_learning/learning_curve.py.

Once you have produced this curve, answer the following questions and place them in a text file called questions.txt in your machine_learning folder:

What is the general trend in the curve?
Are there parts of the curve that appear to be noisier than others? Why?
How many trials do you need to get a smooth curve?
Try different values for C (by changing LogisticRegression(C=10** -10)). What happens? If you want to know why this happens, see this Wikipedia page as well as the documentation for LogisticRegression in scikit-learn.

Turning in Your Toolbox Exercise

To turn in your toolbox, push your changes to learning_curve.py and your writeup (questions.txt) to your GitHub repo.

Further Explorations

If you want to explore more of the features of scikit-learn and of machine learning in general. I recommend you take a look at this Jupyter notebook. If you are still interested, please contact the teaching team… we will be happy to give more pointers.