# Notebook 9: Using Random Forests to classify phases in the Ising Model¶

## Learning Goal¶

The goal of this notebook is to show how one can employ ensemble methods such as Random Forests to classify the states of the 2D Ising model according to their phases. We discuss concepts like decision trees, extreme decision trees, and out-of-bag error. The notebook also introduces the powerful scikit-learn Ensemble class.

## Setting up the problem¶

The Hamiltonian for the classical Ising model is given by

$$H = -J\sum_{\langle ij\rangle}S_{i}S_j,\qquad \qquad S_j\in\{\pm 1\}$$

where the lattice site indices $i,j$ run over all nearest neighbors of a 2D square lattice of side $L$, and $J$ is some arbitrary interaction energy scale. We adopt periodic boundary conditions. Onsager proved that this model undergoes a phase transition in the thermodynamic limit from an ordered ferromagnet with all spins aligned to a disordered phase at the critical temperature $T_c/J=1/\log(1+\sqrt{2})\approx 2.26$. For any finite system size, this critical point is expanded to a critical region around $T_c$.

We will use the same basic idea as we did for logistic regression. An interesting question to ask is whether one can train a statistical model to distinguish between the two phases of the Ising model. In other words, given an Ising state, we would like to classify whether it belongs to the ordered or the disordered phase, without any additional information other than the spin configuration itself. This categorical machine learning problem is well suited for ensemble methods and in particular Random Forests.

To this end, we consider the 2D Ising model on a $40\times 40$ square lattice, and use Monte-Carlo (MC) sampling to prepare $10^4$ states at every fixed temperature $T$ out of a pre-defined set. Using Onsager's criterion, we can assign a label to each state according to its phase: $0$ if the state is disordered, and $1$ if it is ordered.

It is well-known that, near the critical temperature $T_c$, the ferromagnetic correlation length diverges which, among others, leads to a critical slowing down of the MC algorithm. Therefore, we expect identifying the phases to be harder in the critical region. With this in mind, consider the following three types of states: ordered ($T/J<2.0$), critical ($2.0\leq T/J\leq 2.5)$ and disordered ($T/J>2.5$). We use both ordered and disordered states to train the random forest and, once the supervised training procedure is complete, we shall evaluate the performance of our classifier on unseen ordered, disordered and critical states.

A link to the Ising dataset can be found at https://physics.bu.edu/~pankajm/MLnotebooks.html.

In [1]:
import numpy as np

np.random.seed() # shuffle random seed generator

# Ising model parameters
L=40 # linear system size
J=-1.0 # Ising interaction
T=np.linspace(0.25,4.0,16) # set of temperatures
T_c=2.26 # Onsager critical temperature in the TD limit

In [9]:
import pickle, os
from urllib.request import urlopen

# path to data directory (for testing)
#path_to_data=os.path.expanduser('~')+'/Dropbox/MachineLearningReview/Datasets/isingMC/'

url_main = 'https://physics.bu.edu/~pankajm/ML-Review-Datasets/isingMC/';

# The data consists of 16*10000 samples taken in T=np.arange(0.25,4.0001,0.25):
data_file_name = "Ising2DFM_reSample_L40_T=All.pkl"
# The labels are obtained from the following file:
label_file_name = "Ising2DFM_reSample_L40_T=All_labels.pkl"

#DATA
data = pickle.load(urlopen(url_main + data_file_name)) # pickle reads the file and returns the Python object (1D array, compressed bits)
data = np.unpackbits(data).reshape(-1, 1600) # Decompress array and reshape for convenience
data=data.astype('int')
data[np.where(data==0)]=-1 # map 0 state to -1 (Ising variable can take values +/-1)

#LABELS (convention is 1 for ordered states and 0 for disordered states)
labels = pickle.load(urlopen(url_main + label_file_name)) # pickle reads the file and returns the Python object (here just a 1D array with the binary labels)

In [10]:
###### define ML parameters
from sklearn.model_selection import train_test_split
train_to_test_ratio=0.8 # training samples

# divide data into ordered, critical and disordered
X_ordered=data[:70000,:]
Y_ordered=labels[:70000]

X_critical=data[70000:100000,:]
Y_critical=labels[70000:100000]

X_disordered=data[100000:,:]
Y_disordered=labels[100000:]

del data,labels

# define training and test data sets
X=np.concatenate((X_ordered,X_disordered))
Y=np.concatenate((Y_ordered,Y_disordered))

# pick random data points from ordered and disordered states
# to create the training and test sets
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,train_size=train_to_test_ratio,test_size=1.0-train_to_test_ratio)

print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print()
print(X_train.shape[0], 'train samples')
print(X_critical.shape[0], 'critical samples')
print(X_test.shape[0], 'test samples')

X_train shape: (104000, 1600)
Y_train shape: (104000,)

104000 train samples
30000 critical samples
26000 test samples

In [4]:
##### plot a few Ising states
%matplotlib inline

#import ml_style as style
import matplotlib as mpl
import matplotlib.pyplot as plt
#mpl.rcParams.update(style.style)

from mpl_toolkits.axes_grid1 import make_axes_locatable

# set colourbar map
cmap_args=dict(cmap='plasma_r')

# plot states
fig, axarr = plt.subplots(nrows=1, ncols=3)

axarr[0].imshow(X_ordered[20001].reshape(L,L),**cmap_args)
#axarr[0].set_title('$\\mathrm{ordered\\ phase}$',fontsize=16)
axarr[0].set_title('ordered phase',fontsize=16)
axarr[0].tick_params(labelsize=16)

axarr[1].imshow(X_critical[10001].reshape(L,L),**cmap_args)
#axarr[1].set_title('$\\mathrm{critical\\ region}$',fontsize=16)
axarr[1].set_title('critical region',fontsize=16)
axarr[1].tick_params(labelsize=16)

im=axarr[2].imshow(X_disordered[50001].reshape(L,L),**cmap_args)
#axarr[2].set_title('$\\mathrm{disordered\\ phase}$',fontsize=16)
axarr[2].set_title('disordered phase',fontsize=16)
axarr[2].tick_params(labelsize=16)

plt.show()


## Random Forests¶

Hyperparameters

We start by training with Random Forests. As discussed in Sec. VIII of the review, Random Forests are ensemble models. Here we will use the sci-kit learn implementation of random forests. There are two main hyper-parameters that will be important in practice for the performance of the algorithm and the degree to which it overfits/underfits: the number of estimators in the ensemble and the depth of the trees used. The former is controlled by the parameter n_estimators whereas the latter (the complexity of the trees used) can be controlled in many distinct ways (min_samples_split, min_samples_leaf, min_impurity_decrease, etc). For our simple dataset, it does not really make much difference which one of these we use. We will just use the min_samples_split parameter that dictates how many samples need to be in each node of the classification tree. The bigger this number, the more coarse our trees and data partitioning.

In the code below, we will just consider extremely fine trees (min_samples_split=2) or extremely coarse trees (min_samples_split=10000). As we will see, both of these tree complexities are sufficient to distinguish the ordered from the disordered samples. The reason for this is that the ordered and disordered phases are distinguished by the magnetization order parameter which is an equally weighted sum of all features. However, if we want to train deep in these simple phases, and then use our algorithm to distinguish critical samples it is crucial we use more complex trees even though the performance on the disordered and ordered phases is indistinguishable for coarse and complex trees.

Out of Bag (OOB) Estimates

For more complicated datasets, how can we choose the right hyperparameters? We can actually make use of one of the most important and interesting features of ensemble methods that employ Bagging: out-of-bag (OOB) estimates. Whenever we bag data, since we are drawing samples with replacement, we can ask how well our classifiers do on data points that are not used in the training. This is the out-of-bag prediction error and plays a similar role to cross-validation error in other ML methods. Since this is the best proxy for out-of-sample prediction, we choose hyperparameters to minimize the out-of-bag error.

In [5]:
# Apply Random Forest

#This is the random forest classifier
from sklearn.ensemble import RandomForestClassifier

#This is the extreme randomized trees
from sklearn.ensemble import ExtraTreesClassifier

#import time to see how perforamance depends on run time

import time

import warnings
#Comment to turn on warnings
warnings.filterwarnings("ignore")

#We will check

min_estimators = 10
max_estimators = 101
classifer = RandomForestClassifier # BELOW WE WILL CHANGE for the case of extremly randomized forest

n_estimator_range=np.arange(min_estimators, max_estimators, 10)
leaf_size_list=[2,10000]

m=len(n_estimator_range)
n=len(leaf_size_list)

#Allocate Arrays for various quantities

RFC_OOB_accuracy=np.zeros((n,m))
RFC_train_accuracy=np.zeros((n,m))
RFC_test_accuracy=np.zeros((n,m))
RFC_critical_accuracy=np.zeros((n,m))
run_time=np.zeros((n,m))

print_flag=True

for i, leaf_size in enumerate(leaf_size_list):
# Define Random Forest Classifier
myRF_clf = classifer(
n_estimators=min_estimators,
max_depth=None,
min_samples_split=leaf_size, # minimum number of sample per leaf
oob_score=True,
random_state=0,
warm_start=True # this ensures that you add estimators without retraining everything
)
for j, n_estimator in enumerate(n_estimator_range):

print('n_estimators: %i, leaf_size: %i'%(n_estimator,leaf_size))

start_time = time.time()
myRF_clf.set_params(n_estimators=n_estimator)
myRF_clf.fit(X_train, Y_train)
run_time[i,j] = time.time() - start_time

# check accuracy
RFC_train_accuracy[i,j]=myRF_clf.score(X_train,Y_train)
RFC_OOB_accuracy[i,j]=myRF_clf.oob_score_
RFC_test_accuracy[i,j]=myRF_clf.score(X_test,Y_test)
RFC_critical_accuracy[i,j]=myRF_clf.score(X_critical,Y_critical)
if print_flag:
result = (run_time[i,j], RFC_train_accuracy[i,j], RFC_OOB_accuracy[i,j], RFC_test_accuracy[i,j], RFC_critical_accuracy[i,j])
print('{0:<15}{1:<15}{2:<15}{3:<15}{4:<15}'.format("time (s)","train score", "OOB estimate","test score", "critical score"))
print('{0:<15.4f}{1:<15.4f}{2:<15.4f}{3:<15.4f}{4:<15.4f}'.format(*result))

n_estimators: 10, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
9.2898         1.0000         0.9935         1.0000         0.8046
n_estimators: 20, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
10.6592        1.0000         0.9998         1.0000         0.8190
n_estimators: 30, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
11.5031        1.0000         1.0000         1.0000         0.8251
n_estimators: 40, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
13.0026        1.0000         1.0000         1.0000         0.8279
n_estimators: 50, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
16.5471        1.0000         1.0000         1.0000         0.8311
n_estimators: 60, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
17.3193        1.0000         1.0000         1.0000         0.8326
n_estimators: 70, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
18.7276        1.0000         1.0000         1.0000         0.8323
n_estimators: 80, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
21.0295        1.0000         1.0000         1.0000         0.8328
n_estimators: 90, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
22.3827        1.0000         1.0000         1.0000         0.8321
n_estimators: 100, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
22.8116        1.0000         1.0000         1.0000         0.8326
n_estimators: 10, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
6.8829         0.9990         0.9901         0.9992         0.6869
n_estimators: 20, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
9.5803         0.9992         0.9984         0.9995         0.6870
n_estimators: 30, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
9.7861         0.9992         0.9990         0.9995         0.6866
n_estimators: 40, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
11.4167        0.9992         0.9992         0.9995         0.6878
n_estimators: 50, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
13.7088        0.9992         0.9992         0.9995         0.6869
n_estimators: 60, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
14.7469        0.9992         0.9992         0.9995         0.6871
n_estimators: 70, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
16.5755        0.9992         0.9992         0.9995         0.6866
n_estimators: 80, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
18.9235        0.9992         0.9992         0.9995         0.6863
n_estimators: 90, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
20.5959        0.9992         0.9992         0.9995         0.6861
n_estimators: 100, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
22.3674        0.9992         0.9992         0.9995         0.6865

In [6]:
plt.figure()
plt.plot(n_estimator_range,RFC_train_accuracy[1],'--b^',label='Train (coarse)')
plt.plot(n_estimator_range,RFC_test_accuracy[1],'--r^',label='Test (coarse)')
plt.plot(n_estimator_range,RFC_critical_accuracy[1],'--g^',label='Critical (coarse)')

plt.plot(n_estimator_range,RFC_train_accuracy[0],'o-b',label='Train (fine)')
plt.plot(n_estimator_range,RFC_test_accuracy[0],'o-r',label='Test (fine)')
plt.plot(n_estimator_range,RFC_critical_accuracy[0],'o-g',label='Critical (fine)')

#plt.semilogx(lmbdas,train_accuracy_SGD,'*--b',label='SGD train')

plt.xlabel('$N_\mathrm{estimators}$')
plt.ylabel('Accuracy')
plt.savefig("Ising_RF.pdf",bbox_extra_artists=(lgd,), bbox_inches='tight')

plt.show()

plt.plot(n_estimator_range, run_time[1], '--k^',label='Coarse')
plt.plot(n_estimator_range, run_time[0], 'o-k',label='Fine')
plt.xlabel('$N_\mathrm{estimators}$')
plt.ylabel('Run time (s)')

plt.legend(loc=2)
#plt.savefig("Ising_RF_Runtime.pdf")

plt.show()


## Extremely Randomized Trees¶

As discussed in the main text, the effectiveness of ensemble methods generally increases as the correlations between members of the ensemble decrease. This idea has been leveraged to make methods that introduce even more randomness into the ensemble by randomly choosing features to split on as well as randomly choosing thresholds to split on. See Section 4.3 of Louppe 2014 arxiv:1407.7502.

Here we will make use of the scikit-learn function ExtremeTreesClassifier and we will just rerun what we did above. Since there is extra randomization compared to random forests, one can imagine that the performance of the critical samples will be much worse. Indeed, this is the case.

In [7]:
#This is the extreme randomized trees
from sklearn.ensemble import ExtraTreesClassifier

#import time to see how perforamance depends on run time

import time

import warnings
#Comment to turn on warnings
warnings.filterwarnings("ignore")

#We will check

min_estimators = 10
max_estimators = 101
classifer = ExtraTreesClassifier # only changing this

n_estimator_range=np.arange(min_estimators, max_estimators, 10)
leaf_size_list=[2,10000]

m=len(n_estimator_range)
n=len(leaf_size_list)

#Allocate Arrays for various quantities

ETC_OOB_accuracy=np.zeros((n,m))
ETC_train_accuracy=np.zeros((n,m))
ETC_test_accuracy=np.zeros((n,m))
ETC_critical_accuracy=np.zeros((n,m))
run_time=np.zeros((n,m))

print_flag=True

for i, leaf_size in enumerate(leaf_size_list):
# Define Random Forest Classifier
myRF_clf = classifer(
n_estimators=min_estimators,
max_depth=None,
min_samples_split=leaf_size, # minimum number of sample per leaf
oob_score=True,
bootstrap=True,
random_state=0,
warm_start=True # this ensures that you add estimators without retraining everything
)
for j, n_estimator in enumerate(n_estimator_range):

print('n_estimators: %i, leaf_size: %i'%(n_estimator,leaf_size))

start_time = time.time()
myRF_clf.set_params(n_estimators=n_estimator)
myRF_clf.fit(X_train, Y_train)
run_time[i,j] = time.time() - start_time

# check accuracy
ETC_train_accuracy[i,j]=myRF_clf.score(X_train,Y_train)
ETC_OOB_accuracy[i,j]=myRF_clf.oob_score_
ETC_test_accuracy[i,j]=myRF_clf.score(X_test,Y_test)
ETC_critical_accuracy[i,j]=myRF_clf.score(X_critical,Y_critical)
if print_flag:
result = (run_time[i,j], ETC_train_accuracy[i,j], ETC_OOB_accuracy[i,j], ETC_test_accuracy[i,j], ETC_critical_accuracy[i,j])
print('{0:<15}{1:<15}{2:<15}{3:<15}{4:<15}'.format("time (s)","train score", "OOB estimate","test score", "critical score"))
print('{0:<15.4f}{1:<15.4f}{2:<15.4f}{3:<15.4f}{4:<15.4f}'.format(*result))

n_estimators: 10, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
6.9171         1.0000         0.9932         1.0000         0.8024
n_estimators: 20, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
8.8079         1.0000         0.9997         1.0000         0.8180
n_estimators: 30, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
10.4896        1.0000         0.9999         1.0000         0.8264
n_estimators: 40, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
13.4032        1.0000         1.0000         1.0000         0.8298
n_estimators: 50, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
14.5475        1.0000         0.9999         1.0000         0.8307
n_estimators: 60, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
15.7837        1.0000         1.0000         1.0000         0.8318
n_estimators: 70, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
18.3969        1.0000         1.0000         1.0000         0.8336
n_estimators: 80, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
20.1621        1.0000         1.0000         1.0000         0.8341
n_estimators: 90, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
21.5447        1.0000         1.0000         1.0000         0.8357
n_estimators: 100, leaf_size: 2
time (s)       train score    OOB estimate   test score     critical score
25.8101        1.0000         1.0000         1.0000         0.8357
n_estimators: 10, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
6.5247         0.9991         0.9900         0.9993         0.6834
n_estimators: 20, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
8.3836         0.9992         0.9983         0.9995         0.6851
n_estimators: 30, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
9.8547         0.9992         0.9990         0.9995         0.6859
n_estimators: 40, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
11.0481        0.9992         0.9992         0.9995         0.6870
n_estimators: 50, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
13.4941        0.9992         0.9992         0.9995         0.6866
n_estimators: 60, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
15.2564        0.9992         0.9992         0.9995         0.6869
n_estimators: 70, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
17.3889        0.9992         0.9992         0.9995         0.6860
n_estimators: 80, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
18.4751        0.9992         0.9992         0.9995         0.6858
n_estimators: 90, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
20.0756        0.9992         0.9992         0.9995         0.6859
n_estimators: 100, leaf_size: 10000
time (s)       train score    OOB estimate   test score     critical score
21.8532        0.9992         0.9992         0.9995         0.6862

In [8]:
plt.figure()
plt.plot(n_estimator_range,ETC_train_accuracy[1],'--b^',label='Train (coarse)')
plt.plot(n_estimator_range,ETC_test_accuracy[1],'--r^',label='Test (coarse)')
plt.plot(n_estimator_range,ETC_critical_accuracy[1],'--g^',label='Critical (coarse)')

plt.plot(n_estimator_range,ETC_train_accuracy[0],'o-b',label='Train (fine)')
plt.plot(n_estimator_range,ETC_test_accuracy[0],'o-r',label='Test (fine)')
plt.plot(n_estimator_range,ETC_critical_accuracy[0],'o-g',label='Critical (fine)')

#plt.semilogx(lmbdas,train_accuracy_SGD,'*--b',label='SGD train')

plt.xlabel('$N_\mathrm{estimators}$')
plt.ylabel('Accuracy')
plt.savefig("Ising_RF.pdf",bbox_extra_artists=(lgd,), bbox_inches='tight')

plt.show()

plt.plot(n_estimator_range, run_time[1], '--k^',label='Coarse')
plt.plot(n_estimator_range, run_time[0], 'o-k',label='Fine')
plt.xlabel('$N_\mathrm{estimators}$')
plt.ylabel('Run time (s)')

plt.legend(loc=2)
#plt.savefig("Ising_ETC_Runtime.pdf")

plt.show()


### Exercises:¶

• [Random Forest] Consider $B$ random variables, each with variance $\sigma^2$. Show that **1)** If these random variables are i.i.d., then their average has a variance $\frac{1}{B}\sigma^2$. **2)** If they are only i.d. (i.e. identically distributed but not necessarily independent) with positive pairwise correlation $\rho$, then the variance of their average is $\rho\sigma^2+\frac{1-\rho}{B}\sigma^2$. In this case, what does $B\gg 1$ imply? Justify that *by random selection of input features, random forest improves the variance reduction of bagging by reducing the correlation between the trees without dramatic increase of variance*.
• [Random Forest] Consider a random forest $G$ consisting of $K$ binary classification trees, $\{g_k| k=1,\cdots K\}$, where $K$ is an odd integer. Suppose each tree evaluates classification error based on binary counts (i.e. 0/1 error) with $E_{out}(g_k)=e_k$, prove or disprove that $\frac{2}{K+1}\sum_{k=1}^K e_k$ upper bounds $E_{out}(G)$.
• [OOB] For a data set with $N$ samples, what's the probability that one of samples, say $(\boldsymbol{x}_n,y_n)$, is never sampled after bootstrapping $N'$ times? Show that if $N'=N$ and $N\gg 1$, this probability is approximately $e^{-1}$.
• [OOB] Following the previous question, if $N'=pN$ and assuming $N\gg 1$, argue that $Ne^{-p}$ examples in the data set will never be sampled at all.
• [OOB- programming] We argued that OOB is a good proxy for out-of-sample prediction due to bootstrapping. However, in practice OOB tends to give overly pessimistic estimate. To explore this, now instead of using OOB, try to cross-validation and redo all the analysis. You may find this tutorial [this tutorial](http://scikit-learn.org/stable/modules/cross_validation.html) on Scikit useful.