This notebook will serve as an introduction to the logistic regression as well as the new extremely powerful TensorFlow library for Machine Learning (ML) from Google. We will also learn to use the versatile Pandas package for handling data. For those of you familiar with R, the Pandas objects are extremely similar to the dataframe objects in R.

Through out, we will work with the SUSY dataset. It is avaible from the UCI Machine Learning Repository, a very comprehensive and useful list of datasets relevant to ML.

Here is the description of the SUSY dataset we will be playing around with for the rest of the semester:

The data has been produced using Monte Carlo simulations. The first 8 features are kinematic properties measured by the particle detectors in the accelerator. The last ten features are functions of the first 8 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks and the dropout algorithm are presented in the original paper. The last 500,000 examples are used as a test set.n about your data set.

This dataset comes from this interesting paper by the UCI group: Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5 (July 2, 2014).

So far, we have largely focused on supervised learning tasks such as linear regression where the goal is to make predictions about continuous lablels. Often, we are also interested in classification tasks -- where the goal is classification. The training data consists of a set of features and discrete labels. This type of data is called categorical data (the data comes in different categories).

Initially, we will focus on a binary classification task. In the SUSY dataset, the goal is to decide whether a data point repsents signal "potential collision"- labeled 1 or "noise"- labeled 0. This is done by looking at 18 features- the first 8 of which are "low-level" features that can be directly measured and the last 10 features are "higher-order" features constructed using physics intuition. In more detail:

The first column is the class label (1 for signal, 0 for background), followed by the 18 features (8 low-level features then 10 high-level features):: lepton 1 pT, lepton 1 eta, lepton 1 phi, lepton 2 pT, lepton 2 eta, lepton 2 phi, missing energy magnitude, missing energy phi, MET_rel, axial MET, M_R, M_TR_2, R, MT2, S_R, M_Delta_R, dPhi_r_b, cos(theta_r1)

Our goal will be to use the either the first 8 features or the full 18 features to predict whether an event is signal or noise.

One of the best understood and cannonical methods for performing such a taks is Logistic Regression. We will see that a deep understanding of Logistic regression will introduce us to many of the ideas and techniques at the forefront on modern Machine Learning. In Logistic regression, each set of features $\mathbf{x}_i$ is associated with a category $C_i=1,0$, with $i=1\ldots n$. It is helpful to re-define $\mathbf{x}$ to be an extended vector $\mathbf{x}\rightarrow (1,\mathbf{x})$ (which just accounts from an intercept). Then, the Likelihood function for Logistic regression is given by the sigmoid (Fermi) function

$$ P(c_i=1)=1-P(c_i=0)= {1 \over 1+ e^{-\mathbf{w}\cdot \mathbf{x}_i}}, $$where $\mathbf{w}$ are the weights that define the logistic regression. Notice that this is just the Fermi function with excitation energy, $E=-\mathbf{w}\cdot \mathbf{x}$.

As before, we will maximize the Log-likelihood of the observed data. Let us define the function $$ f(a)={1 \over 1+ e^{-a}}, $$ Notice that the derivative with respect to $a$ is given by $$ {df \over da}= f(1-f). $$

Define $f_i \equiv f(\mathbf{w}\cdot \mathbf{x})$. Then, the Likelihood of the data $\{ \mathbf{x}_i, C_i \}$ is given by $$ P(Data|\mathbf{x})= \prod_{i=1}^n f_i^{C_i}(1-f_i)^{1-C_i} $$ and the log-likelihood is given by $$ \log{P(Data|\mathbf{w})}= \sum_{i=1}^n C_i \log f_i + (1-C_i)\log(1-f_i) $$

The negative of the log-likelihood gives us the cross-entropy error function $$ \mathrm{Cross\,Entropy}=E(\mathbf{w})= -\sum_{i=1}^n C_i \log f_i + (1-C_i)\log(1-f_i). $$

Using the formula above notice that $$ \nabla E(\mathbf{w})=\sum_{i=1}^n (f_i-C_i)\mathbf{x}_i. $$ In other words, the gradient points in the sum of training example directions weighted by the difference between the true label and the probability of predicting that label.

Notice the Maximum-Likelihood Estimation (MLE) is the same as minimizing the cross-entropy. There is no closed form way of doing this. One strategy is to start with an arbitrary $\mathbf{w}$ and then update our estimate based on our error function. In particular, we would like to nudge $\mathbf{w}$ in the direction where the error is descreasing the fastest. This is the idea behind gradient descent. Futhermore, we can show that cross-entropy error function used in logistic regression has a unique minima. Thus, we can perform this procedure with relative ease (However, as a word of caution, note there is a generic instability in the MLE procedure for linearly seperable data).

Theoretically, one nice method for doing this is the * Newton-Rahpson * method. In this method, we iteratively calculate the gradient
$$
\mathbf{w}^{new} \leftarrow \mathbf{w}^{old} - \mathbf{H}^{-1} \nabla E(\mathbf{w}),
$$
where $\mathbf{H}$ is the Hessian matrix which is the second derivative of the energy function. For OLS linear regression, one can show that this procedure yields the right answer.

More generally, there are a number of generalizations of this idea that have been proposed. We wil refer to these kinds of methods as generalized gradient descent methods and discuss them extensively in what follows.

** Excercise:** In what follows, use Pandas to import the first 10,000 examples and call that the training data and import the next 1000 examples and call that the test data.

In [55]:

```
# Importing the SUSY Data set
import pandas as pd
import tensorflow as tf
#Download SUSY from UCI ML archive
filename="SUSY.csv.gz"
columns=["signal", "lepton 1 pT", "lepton 1 eta", "lepton 1 phi", "lepton 2 pT", "lepton 2 eta",
"lepton 2 phi", "missing energy magnitude", "missing energy phi", "MET_rel",
"axial MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos(theta_r1)"]
#Load 10,000 rows as train data
df_train=pd.read_csv(filename, names=columns, compression='gzip', nrows=10000)
#Load 1,000 rows as test data
df_test=pd.read_csv(filename, names=columns, compression='gzip', nrows=1000, skiprows=10000)
#Check of size of objects
print df_test.shape
print df_train.shape
```

We now will run logistic regression using MLE.

**Excercise:** Using the tensor flow tutorial here, run logistic regression on the SUSY data for both the simple features (first 8 features) and the full feature space. How well did you do?

** Excercise:** Do the same thing in the scikit package.

**Exercise:** Now repeat the exercise using L1 regularization, L2 regularization. Does this help your out-of-sample prediction? What is the difference between the two form of regularization.

In [ ]:

```
```