{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 1: Why is Machine Learning difficult?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview \n", "\n", "In this notebook, we will get our hands dirty trying to gain intuition about why machine learning is difficult. \n", "\n", "Our task is going to be a simple one, fitting data with polynomials of different order. Formally, this goes under the name of polynomial regression. Here we will do a series of exercises that are intended to give the reader intuition about the major challenges that any machine learning algorithm faces.\n", "\n", "## Learning Goal\n", "\n", "We will explore how our ability to predict depends on the number of data points we have, the \"noise\" in the data, and our knowledge about relevant features. The goal is to build intuition about why prediction is difficult and discuss general strategies for overcoming these difficulties.\n", "\n", "\n", "## The Prediction Problem\n", "\n", "Consider a probabilistic process that gives rise to labeled data $(x,y)$. The data is generated by drawing samples from the equation\n", "$$\n", " y_i= f(x_i) + \\eta_i,\n", "$$\n", "where $f(x_i)$ is some fixed, but (possibly unknown) function, and $\\eta_i$ is a Gaussian, uncorrelate noise variable such that\n", "$$\n", "\\langle \\eta_i \\rangle=0 \\\\\n", "\\langle \\eta_i \\eta_j \\rangle = \\delta_{ij} \\sigma\n", "$$\n", "We will refer to the $f(x_i)$ as the **true features** used to generate the data. \n", "\n", "To make prediction, we will consider a family of functions $g_\\alpha(x;\\theta_\\alpha)$ that depend on some parameters $\\theta_\\alpha$. These functions respresent the **model class** that we are using to try to model the data and make predictions. The $g_\\alpha(x;\\theta_\\alpha)$ encode the class of **features** we are using to represent the data.\n", "\n", "To learn the parameters $\\boldsymbol{\\theta}$, we will train our models on a **training data set** and then test the effectiveness of the model on a different dataset, the **test data set**. The reason we must divide our data into a training and test dataset is that the point of machine learning is to make accurate predictions about new data we have not seen. As we will see below, models that give the best fit to the training data do not necessarily make the best predictions on the test data. This will be a running theme that we will encounter repeatedly in machine learning. \n", "\n", "\n", "For the remainder of the notebook, we will focus on polynomial regression. Our task is to model the data with polynomials and make predictions about the new data that we have not seen.\n", "We will consider two qualitatively distinct situations: \n", "