{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 1: Why is Machine Learning difficult?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview \n", "\n", "In this notebook, we will get our hands dirty trying to gain intuition about why machine learning is difficult. \n", "\n", "Our task is going to be a simple one, fitting data with polynomials of different order. Formally, this goes under the name of polynomial regression. Here we will do a series of exercises that are intended to give the reader intuition about the major challenges that any machine learning algorithm faces.\n", "\n", "## Learning Goal\n", "\n", "We will explore how our ability to predict depends on the number of data points we have, the \"noise\" in the data, and our knowledge about relevant features. The goal is to build intuition about why prediction is difficult and discuss general strategies for overcoming these difficulties.\n", "\n", "\n", "## The Prediction Problem\n", "\n", "Consider a probabilistic process that gives rise to labeled data $(x,y)$. The data is generated by drawing samples from the equation\n", "$$\n", " y_i= f(x_i) + \\eta_i,\n", "$$\n", "where $f(x_i)$ is some fixed, but (possibly unknown) function, and $\\eta_i$ is a Gaussian, uncorrelate noise variable such that\n", "$$\n", "\\langle \\eta_i \\rangle=0 \\\\\n", "\\langle \\eta_i \\eta_j \\rangle = \\delta_{ij} \\sigma\n", "$$\n", "We will refer to the $f(x_i)$ as the **true features** used to generate the data. \n", "\n", "To make prediction, we will consider a family of functions $g_\\alpha(x;\\theta_\\alpha)$ that depend on some parameters $\\theta_\\alpha$. These functions respresent the **model class** that we are using to try to model the data and make predictions. The $g_\\alpha(x;\\theta_\\alpha)$ encode the class of **features** we are using to represent the data.\n", "\n", "To learn the parameters $\\boldsymbol{\\theta}$, we will train our models on a **training data set** and then test the effectiveness of the model on a different dataset, the **test data set**. The reason we must divide our data into a training and test dataset is that the point of machine learning is to make accurate predictions about new data we have not seen. As we will see below, models that give the best fit to the training data do not necessarily make the best predictions on the test data. This will be a running theme that we will encounter repeatedly in machine learning. \n", "\n", "\n", "For the remainder of the notebook, we will focus on polynomial regression. Our task is to model the data with polynomials and make predictions about the new data that we have not seen.\n", "We will consider two qualitatively distinct situations: \n", "
\n", "
• In the first case, the process that generates the underlying data is in the model class we are using to make predictions. For polynomial regression, this means that the functions $f(x_i)$ are themselves polynomials.\n", "
• In the second case, our data lies outside our model class. In the case of polynomial regression, this could correspond to the case where the $f(x_i)$ is a 10-th order polynomial but $g_\\alpha(x;\\theta_\\alpha)$ are polynomials of order 1 or 3.\n", "
\n", "\n", "In the exercises and discussion we consider 3 model classes:\n", "
\n", "
• the case where the $g_\\alpha(x;\\theta_\\alpha)$ are all polynomials up to order 1 (linear models),\n", "
• the case where the $g_\\alpha(x;\\theta_\\alpha)$ are all polynomials up to order 3,\n", "
• the case where the $g_\\alpha(x;\\theta_\\alpha)$ are all polynomials up to order 10.\n", "
\n", "\n", "To measure our ability to predict, we will learn our parameters by fitting our training dataset and then making predictions on our test data set. One common measure of predictive performance of our algorithm is to compare the predictions,$\\{y_j^\\mathrm{pred}\\}$, to the true values $\\{y_j\\}$. A commonly employed measure for this is the sum of the mean square-error (MSE) on the test set:\n", "$$\n", "MSE= \\frac{1}{N_\\mathrm{test}}\\sum_{j=1}^{N_\\mathrm{test}} (y_j^\\mathrm{pred}-y_j)^2\n", "$$\n", "We will return to this in later notebooks. For now, we will try to get a qualitative picture by examining plots on test and training data.\n", "\n", "## Fitting vs. predicting when the data is in the model class\n", "\n", "\n", "We start by considering the case:\n", "$$\n", "f(x)=2x.\n", "$$\n", "Then the data is clearly generated by a model that is contained within all three model classes we are using to make predictions (linear models, third order polynomials, and tenth order polynomials). \n", "\n", "\n", "Run the code for the following cases:\n", "
\n", "
• For $f(x)=2x$, $N_{\\mathrm{train}}=10$ and $\\sigma=0$ (noiseless case), train the three classes of models (linear, third-order polynomial, and tenth order polynomial) for a training set when $x_i \\in [0,1]$. Make graphs comparing fits for different order of polynomials. Which model fits the data the best?\n", "
• Do you think that the data that has the least error on the training set will also make the best predictions? Why or why not? Can you try to discuss and formalize your intuition? What can go right and what can go wrong?\n", "
• Check your answer by seeing how well your fits predict newly generated test data (including on data outside the range you fit on, for example $x \\in [0,1.2]$) using the code below. How well do you do on points in the range of $x$ where you trained the model? How about points outside the original training data set? \n", "
• Repeat the exercises above for $f(x)=2x$, $N_{\\mathrm{train}}=10$, and $\\sigma=1$. What changes?\n", "
• Repeat the exercises above for $f(x)=2x$, $N_{\\mathrm{train}}=100$, and $\\sigma=1$. What changes?\n", "
• Summarize what you have learned about the relationship between model complexity (number of parameters), goodness of fit on training data, and the ability to predict well.\n", "
\n", "\n", "\n", "## Fitting vs. predicting when the data is not in the model class\n", "Thus far, we have considered the case where the data is generated using a model contained in the model class. Now consider $f(x)=2x-10x^5+15x^{10}$. *Notice that the for linear and third-order polynomial the true model $f(x)$ is not contained in model class $g_\\alpha(x)$* .\n", "\n", "
\n", "
• Repeat the exercises above fitting and predicting for $f(x)=2x-10x^5+15x^{10}$ for $N_{\\mathrm{train}}=10,100$ and $\\sigma=0,1$. Record your observations.\n", "
• Do better fits lead to better predictions?\n", "
• What is the relationship between the true model for generating the data and the model class that has the most predictive power? How is this related to the model complexity? How does this depend on the number of data points $N_{\\mathrm{train}}$ and $\\sigma$?\n", "