{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 1: Why is Machine Learning difficult?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview \n", "\n", "In this notebook, we will get our hands dirty trying to gain intuition about why machine learning is difficult. \n", "\n", "Our task is going to be a simple one, fitting data with polynomials of different order. Formally, this goes under the name of polynomial regression. Here we will do a series of exercises that are intended to give the reader intuition about the major challenges that any machine learning algorithm faces.\n", "\n", "## Learning Goal\n", "\n", "We will explore how our ability to predict depends on the number of data points we have, the \"noise\" in the data, and our knowledge about relevant features. The goal is to build intuition about why prediction is difficult and discuss general strategies for overcoming these difficulties.\n", "\n", "\n", "## The Prediction Problem\n", "\n", "Consider a probabilistic process that gives rise to labeled data $(x,y)$. The data is generated by drawing samples from the equation\n", "$$\n", " y_i= f(x_i) + \\eta_i,\n", "$$\n", "where $f(x_i)$ is some fixed, but (possibly unknown) function, and $\\eta_i$ is a Gaussian, uncorrelate noise variable such that\n", "$$\n", "\\langle \\eta_i \\rangle=0 \\\\\n", "\\langle \\eta_i \\eta_j \\rangle = \\delta_{ij} \\sigma\n", "$$\n", "We will refer to the $f(x_i)$ as the **true features** used to generate the data. \n", "\n", "To make prediction, we will consider a family of functions $g_\\alpha(x;\\theta_\\alpha)$ that depend on some parameters $\\theta_\\alpha$. These functions respresent the **model class** that we are using to try to model the data and make predictions. The $g_\\alpha(x;\\theta_\\alpha)$ encode the class of **features** we are using to represent the data.\n", "\n", "To learn the parameters $\\boldsymbol{\\theta}$, we will train our models on a **training data set** and then test the effectiveness of the model on a different dataset, the **test data set**. The reason we must divide our data into a training and test dataset is that the point of machine learning is to make accurate predictions about new data we have not seen. As we will see below, models that give the best fit to the training data do not necessarily make the best predictions on the test data. This will be a running theme that we will encounter repeatedly in machine learning. \n", "\n", "\n", "For the remainder of the notebook, we will focus on polynomial regression. Our task is to model the data with polynomials and make predictions about the new data that we have not seen.\n", "We will consider two qualitatively distinct situations: \n", "\n", "\n", "In the exercises and discussion we consider 3 model classes:\n", "\n", "\n", "To measure our ability to predict, we will learn our parameters by fitting our training dataset and then making predictions on our test data set. One common measure of predictive performance of our algorithm is to compare the predictions,$\\{y_j^\\mathrm{pred}\\}$, to the true values $\\{y_j\\}$. A commonly employed measure for this is the sum of the mean square-error (MSE) on the test set:\n", "$$\n", "MSE= \\frac{1}{N_\\mathrm{test}}\\sum_{j=1}^{N_\\mathrm{test}} (y_j^\\mathrm{pred}-y_j)^2\n", "$$\n", "We will return to this in later notebooks. For now, we will try to get a qualitative picture by examining plots on test and training data.\n", "\n", "## Fitting vs. predicting when the data is in the model class\n", "\n", "\n", "We start by considering the case:\n", "$$\n", "f(x)=2x.\n", "$$\n", "Then the data is clearly generated by a model that is contained within all three model classes we are using to make predictions (linear models, third order polynomials, and tenth order polynomials). \n", "\n", "\n", "Run the code for the following cases:\n", "\n", "\n", "\n", "## Fitting vs. predicting when the data is not in the model class\n", "Thus far, we have considered the case where the data is generated using a model contained in the model class. Now consider $f(x)=2x-10x^5+15x^{10}$. *Notice that the for linear and third-order polynomial the true model $f(x)$ is not contained in model class $g_\\alpha(x)$* .\n", "\n", "