{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 9: Using Random Forests to classify phases in the Ising Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Learning Goal\n", "\n", "The goal of this notebook is to show how one can employ ensemble methods such as Random Forests to classify the states of the 2D Ising model according to their phases. We discuss concepts like decision trees, extreme decision trees, and out-of-bag error. The notebook also introduces the powerful scikit-learn Ensemble class.\n", "\n", "\n", "## Setting up the problem\n", "\n", "The Hamiltonian for the classical Ising model is given by\n", "\n", "$$H = -J\\sum_{\\langle ij\\rangle}S_{i}S_j,\\qquad \\qquad S_j\\in\\{\\pm 1\\}$$\n", "\n", "where the lattice site indices $i,j$ run over all nearest neighbors of a 2D square lattice of side $L$, and $J$ is some arbitrary interaction energy scale. We adopt periodic boundary conditions. Onsager proved that this model undergoes a phase transition in the thermodynamic limit from an ordered ferromagnet with all spins aligned to a disordered phase at the critical temperature $T_c/J=1/\\log(1+\\sqrt{2})\\approx 2.26$. For any finite system size, this critical point is expanded to a critical region around $T_c$.\n", "\n", "We will use the same basic idea as we did for logistic regression. An interesting question to ask is whether one can train a statistical model to distinguish between the two phases of the Ising model. In other words, given an Ising state, we would like to classify whether it belongs to the ordered or the disordered phase, without any additional information other than the spin configuration itself. This categorical machine learning problem is well suited for ensemble methods and in particular Random Forests.\n", "\n", "To this end, we consider the 2D Ising model on a $40\\times 40$ square lattice, and use Monte-Carlo (MC) sampling to prepare $10^4$ states at every fixed temperature $T$ out of a pre-defined set. Using Onsager's criterion, we can assign a label to each state according to its phase: $0$ if the state is disordered, and $1$ if it is ordered. \n", "\n", "It is well-known that, near the critical temperature $T_c$, the ferromagnetic correlation length diverges which, among others, leads to a critical slowing down of the MC algorithm. Therefore, we expect identifying the phases to be harder in the critical region. With this in mind, consider the following three types of states: ordered ($T/J<2.0$), critical ($2.0\\leq T/J\\leq 2.5)$ and disordered ($T/J>2.5$). We use both ordered and disordered states to train the random forest and, once the supervised training procedure is complete, we shall evaluate the performance of our classifier on unseen ordered, disordered and critical states. \n", "\n", "A link to the Ising dataset can be found at [https://physics.bu.edu/~pankajm/MLnotebooks.html](https://physics.bu.edu/~pankajm/MLnotebooks.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "np.random.seed() # shuffle random seed generator\n", "\n", "# Ising model parameters\n", "L=40 # linear system size\n", "J=-1.0 # Ising interaction\n", "T=np.linspace(0.25,4.0,16) # set of temperatures\n", "T_c=2.26 # Onsager critical temperature in the TD limit" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import pickle, os\n", "from urllib.request import urlopen \n", "\n", "# path to data directory (for testing)\n", "#path_to_data=os.path.expanduser('~')+'/Dropbox/MachineLearningReview/Datasets/isingMC/'\n", "\n", "url_main = 'https://physics.bu.edu/~pankajm/ML-Review-Datasets/isingMC/';\n", "\n", "######### LOAD DATA\n", "# The data consists of 16*10000 samples taken in T=np.arange(0.25,4.0001,0.25):\n", "data_file_name = \"Ising2DFM_reSample_L40_T=All.pkl\" \n", "# The labels are obtained from the following file:\n", "label_file_name = \"Ising2DFM_reSample_L40_T=All_labels.pkl\"\n", "\n", "\n", "#DATA\n", "data = pickle.load(urlopen(url_main + data_file_name)) # pickle reads the file and returns the Python object (1D array, compressed bits)\n", "data = np.unpackbits(data).reshape(-1, 1600) # Decompress array and reshape for convenience\n", "data=data.astype('int')\n", "data[np.where(data==0)]=-1 # map 0 state to -1 (Ising variable can take values +/-1)\n", "\n", "#LABELS (convention is 1 for ordered states and 0 for disordered states)\n", "labels = pickle.load(urlopen(url_main + label_file_name)) # pickle reads the file and returns the Python object (here just a 1D array with the binary labels)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train shape: (104000, 1600)\n", "Y_train shape: (104000,)\n", "\n", "104000 train samples\n", "30000 critical samples\n", "26000 test samples\n" ] } ], "source": [ "###### define ML parameters\n", "from sklearn.model_selection import train_test_split\n", "train_to_test_ratio=0.8 # training samples\n", "\n", "# divide data into ordered, critical and disordered\n", "X_ordered=data[:70000,:]\n", "Y_ordered=labels[:70000]\n", "\n", "X_critical=data[70000:100000,:]\n", "Y_critical=labels[70000:100000]\n", "\n", "X_disordered=data[100000:,:]\n", "Y_disordered=labels[100000:]\n", "\n", "del data,labels\n", "\n", "# define training and test data sets\n", "X=np.concatenate((X_ordered,X_disordered))\n", "Y=np.concatenate((Y_ordered,Y_disordered))\n", "\n", "# pick random data points from ordered and disordered states \n", "# to create the training and test sets\n", "X_train,X_test,Y_train,Y_test=train_test_split(X,Y,train_size=train_to_test_ratio,test_size=1.0-train_to_test_ratio)\n", "\n", "print('X_train shape:', X_train.shape)\n", "print('Y_train shape:', Y_train.shape)\n", "print()\n", "print(X_train.shape[0], 'train samples')\n", "print(X_critical.shape[0], 'critical samples')\n", "print(X_test.shape[0], 'test samples')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "