Let us now apply linear regression to an example that is familiar from Statistical Mechanics: the Ising model. The goal of this notebook is to revisit the concepts of in-sample and out-of-sample errors, as well as $L2$- and $L1$-regularization, in an example that is more intuitive to physicists.
Consider the 1D Ising model with nearest-neighbor interactions
$$H[\boldsymbol{S}]=-J\sum_{j=1}^L S_{j}S_{j+1}$$
on a chain of length $L$ with periodic boundary conditions and $S_j=\pm 1$ Ising spin variables. In one dimension, this paradigmatic model has no phase transition at finite temperature.
We invite the reader who is unfamiliar with the property of the Ising model to solve the following problems.
For a more detailed introduction we refer the reader to consult one of the many textbooks on the subject (see for instance Goldenfeld, Lubensky, Baxter , etc.).
Suppose your boss set $J=1$, drew a large number of spin configurations, and computed their Ising energies. Then, without telling you about the above Hamiltonian, he or she handed you a data set of $i=1\ldots n$ points of the form $\{(H[\boldsymbol{S}^i],\boldsymbol{S}^i)\}$. Your task is to learn the Hamiltonian using Linear regression techniques.
import numpy as np
import scipy.sparse as sp
np.random.seed(12)
import warnings
# Comment this to turn on warnings
warnings.filterwarnings('ignore')
### define Ising model aprams
# system size
L=40
# create 10000 random Ising states
states=np.random.choice([-1, 1], size=(10000,L))
def ising_energies(states):
"""
This function calculates the energies of the states in the nn Ising Hamiltonian
"""
L = states.shape[1]
J = np.zeros((L, L),)
for i in range(L):
J[i,(i+1)%L]=-1.0 # interaction between nearest-neighbors
# compute energies
E = np.einsum('...i,ij,...j->...',states,J,states)
return E
# calculate Ising energies
energies=ising_energies(states)
First of all, we have to decide on a model class (possible Hamiltonians) we use to fit the data. In the absence of any prior knowledge, one sensible choice is the all-to-all Ising model
$$ H_\mathrm{model}[\boldsymbol{S}^i] = - \sum_{j=1}^L \sum_{k=1}^L J_{j,k}S_{j}^iS_{k}^i. $$ Notice that this model is uniquely defined by the non-local coupling strengths $J_{jk}$ which we want to learn. Importantly, this model is linear in ${\mathbf J}$ which makes it possible to use linear regression.
To apply linear regression, we would like to recast this model in the form $$ H_\mathrm{model}^i \equiv \mathbf{X}^i \cdot \mathbf{J}, $$
where the vectors $\mathbf{X}^i$ represent all two-body interactions $\{S_{j}^iS_{k}^i \}_{j,k=1}^L$, and the index $i$ runs over the samples in the data set. To make the analogy complete, we can also represent the dot product by a single index $p = \{j,k\}$, i.e. $\mathbf{X}^i \cdot \mathbf{J}=X^i_pJ_p$. Note that the regression model does not include the minus sign, so we expect to learn negative $J$'s.
# reshape Ising states into RL samples: S_iS_j --> X_p
states=np.einsum('...i,...j->...ij', states, states)
shape=states.shape
states=states.reshape((shape[0],shape[1]*shape[2]))
# build final data set
Data=[states,energies]
As we already mentioned a few times in the review, learning is not fitting: the subtle difference is that once we fit the data to obtain a candidate model, we expect it to generalize to unseen data not used for the fitting procedure. For this reason, we begin by specifying a training and test data sets
# define number of samples
n_samples=400
# define train and test data sets
X_train=Data[0][:n_samples]
Y_train=Data[1][:n_samples] #+ np.random.normal(0,4.0,size=X_train.shape[0])
X_test=Data[0][n_samples:3*n_samples//2]
Y_test=Data[1][n_samples:3*n_samples//2] #+ np.random.normal(0,4.0,size=X_test.shape[0])
In what follows the model performance (in-sample and out-of-sample) is evaluated using the so-called coefficient of determination, which is given by: \begin{align} R^2 &= \left(1-\frac{u}{v}\right),\\ u&=(y_{pred}-y_{true})^2\\ v&=(y_{true}-\langle y_{true}\rangle)^2 \end{align}
The best possible score is 1.0 but it can also be negative. A constant model that always predicts the expected value of $y$, $\langle y_{true}\rangle$, disregarding the input features, would get a $R^2$ score of 0.
from sklearn import linear_model
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn
%matplotlib inline
# set up Lasso and Ridge Regression models
leastsq=linear_model.LinearRegression()
ridge=linear_model.Ridge()
lasso = linear_model.Lasso()
# define error lists
train_errors_leastsq = []
test_errors_leastsq = []
train_errors_ridge = []
test_errors_ridge = []
train_errors_lasso = []
test_errors_lasso = []
# set regularisation strength values
lmbdas = np.logspace(-4, 5, 10)
#Initialize coeffficients for ridge regression and Lasso
coefs_leastsq = []
coefs_ridge = []
coefs_lasso=[]
for lmbda in lmbdas:
### ordinary least squares
leastsq.fit(X_train, Y_train) # fit model
coefs_leastsq.append(leastsq.coef_) # store weights
# use the coefficient of determination R^2 as the performance of prediction.
train_errors_leastsq.append(leastsq.score(X_train, Y_train))
test_errors_leastsq.append(leastsq.score(X_test,Y_test))
### apply RIDGE regression
ridge.set_params(alpha=lmbda) # set regularisation parameter
ridge.fit(X_train, Y_train) # fit model
coefs_ridge.append(ridge.coef_) # store weights
# use the coefficient of determination R^2 as the performance of prediction.
train_errors_ridge.append(ridge.score(X_train, Y_train))
test_errors_ridge.append(ridge.score(X_test,Y_test))
### apply LASSO regression
lasso.set_params(alpha=lmbda) # set regularisation parameter
lasso.fit(X_train, Y_train) # fit model
coefs_lasso.append(lasso.coef_) # store weights
# use the coefficient of determination R^2 as the performance of prediction.
train_errors_lasso.append(lasso.score(X_train, Y_train))
test_errors_lasso.append(lasso.score(X_test,Y_test))
### plot Ising interaction J
J_leastsq=np.array(leastsq.coef_).reshape((L,L))
J_ridge=np.array(ridge.coef_).reshape((L,L))
J_lasso=np.array(lasso.coef_).reshape((L,L))
cmap_args=dict(vmin=-1., vmax=1., cmap='seismic')
fig, axarr = plt.subplots(nrows=1, ncols=3)
axarr[0].imshow(J_leastsq,**cmap_args)
axarr[0].set_title('OLS \n Train$=%.3f$, Test$=%.3f$'%(train_errors_leastsq[-1], test_errors_leastsq[-1]),fontsize=16)
axarr[0].tick_params(labelsize=16)
axarr[1].imshow(J_ridge,**cmap_args)
axarr[1].set_title('Ridge $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lmbda,train_errors_ridge[-1],test_errors_ridge[-1]),fontsize=16)
axarr[1].tick_params(labelsize=16)
im=axarr[2].imshow(J_lasso,**cmap_args)
axarr[2].set_title('LASSO $\lambda=%.4f$\n Train$=%.3f$, Test$=%.3f$' %(lmbda,train_errors_lasso[-1],test_errors_lasso[-1]),fontsize=16)
axarr[2].tick_params(labelsize=16)
divider = make_axes_locatable(axarr[2])
cax = divider.append_axes("right", size="5%", pad=0.05, add_to_figure=True)
cbar=fig.colorbar(im, cax=cax)
cbar.ax.set_yticklabels(np.arange(-1.0, 1.0+0.25, 0.25),fontsize=14)
cbar.set_label('$J_{i,j}$',labelpad=15, y=0.5,fontsize=20,rotation=0)
fig.subplots_adjust(right=2.0)
plt.show()
To quantify learning, we also plot the in-sample and out-of-sample errors
# Plot our performance on both the training and test data
plt.semilogx(lmbdas, train_errors_leastsq, 'b',label='Train (OLS)')
plt.semilogx(lmbdas, test_errors_leastsq,'--b',label='Test (OLS)')
plt.semilogx(lmbdas, train_errors_ridge,'r',label='Train (Ridge)',linewidth=1)
plt.semilogx(lmbdas, test_errors_ridge,'--r',label='Test (Ridge)',linewidth=1)
plt.semilogx(lmbdas, train_errors_lasso, 'g',label='Train (LASSO)')
plt.semilogx(lmbdas, test_errors_lasso, '--g',label='Test (LASSO)')
fig = plt.gcf()
fig.set_size_inches(10.0, 6.0)
#plt.vlines(alpha_optim, plt.ylim()[0], np.max(test_errors), color='k',
# linewidth=3, label='Optimum on test')
plt.legend(loc='lower left',fontsize=16)
plt.ylim([-0.1, 1.1])
plt.xlim([min(lmbdas), max(lmbdas)])
plt.xlabel(r'$\lambda$',fontsize=16)
plt.ylabel('Performance',fontsize=16)
plt.tick_params(labelsize=16)
plt.show()
Let us make a few remarks:
(i) the inverse (see Scikit documentation) regularization parameter $\lambda$ affects the Ridge and LASSO regressions at scales, separated by a few orders of magnitude. Notice that this is different for the data considered in Notebook 3 Section VI: Linear Regression (Diabetes). Therefore, it is considered good practice to always check the performance for the given model and data with $\lambda$ over multiple decades.
(ii) at $\lambda\to 0$ and $\lambda\to\infty$, all three models overfit the data, as can be seen from the deviation of the test errors from unity (dashed lines), while the training curves stay at unity.
(iii) While the OLS and Ridge regression test curves are monotonic, the LASSO test curve is not -- suggesting the optimal LASSO regularization parameter is $\lambda\approx 10^{-2}$. At this sweet spot, the Ising interaction weights ${\bf J}$ contain only nearest-neighbor terms (as did the model the data was generated from).
Gauge degrees of freedom: recall that the uniform nearest-neighbor interactions strength $J_{j,k}=J$ which we used to generate the data was set to unity, $J=1$. Moreover, $J_{j,k}$ was NOT defined to be symmetric (we only used the $J_{j,j+1}$ but never the $J_{j,j-1}$ elements). The colorbar on the matrix elements plot above suggest that the OLS and Ridge regression learn uniform symmetric weights $J=-0.5$. There is no mystery since this amounts to taking into account both the $J_{j,j+1}$ and the $J_{j,j-1}$ terms, and the weights are distributed symmetrically between them. LASSO, on the other hand, can break this symmetry (see matrix elements plots for $\lambda=0.001$ and $\lambda=0.01$). Thus, we see how different regularization schemes can lead to learning equivalent models but in different gauges. Any information we have about the symmetry of the unknown model that generated the data has to be reflected in the definition of the model and the regularization chosen.