The goal of this notebook is to familiarize readers with various energy-based generative models including: Restricted Boltzmann Machines (RBMs) with Gaussian and Bernoulli units, Deep Boltzmann Machines (DBMs), as well as techniques for training these model including contrastive divergence (CD) and persistent constrastive divergence (PCD). We will also discuss how to generate new examples (commonly called fantasy particles in the ML literature). The notebook also introduces the Paysage package from UnLearn.AI for quickly building and training these models..
In this notebook, we study the MNIST dataset using generative models.
Let us adopt a different perspective on the MNIST dataset. Generative models, as the name suggests, are useful to generate brand new data (images) by learning how to imitate the ones from a given data set. If you wish, they can be regarded as a kind of "creative ML". This is achieved by extracting/learning a set of features which represent the backbone of the entire data and are thus definitive for the particular dataset. These features are encoded in the weights of the model.
Mathematically speaking, generative models are designed and trained to learn an approximation for the probability distribution that generated the data. Intuitively, this task is much more complex and sophisticated than the image recognition task. Therefore, we shall approach the problem in a number of steps of increasing complexity. For a more detailed discussion on the theory, we invite the reader to check out Secs. XV and XVI of the review.
Below, we analyze four different generative models: the Hopfield Model, a Restricted Boltzmann Machine (RBM), a regularized RMB with sparse weights, and a Deep Boltzmann Machine (DBM). In all these cases, we first set up and train the models. After that, we compare the results. Last, we open up the black box of each model and visualise a set of features it learned.
In this notebook, we use an open-source python
package for energy-based models, called paysage. Paysage requires python>3.5
; we recommend using the package with an Anaconda environment.
To install paysage:
pip install .
Documentation for paysage is available under https://github.com/drckf/paysage/tree/master/docs.
By default, computations in paysage are performed using numpy
/numexpr
/numba
on the CPU. If you have installed PyTorch, then you can switch to the pytorch
backend by changing the setting in paysage/backends/config.json
to pytorch
.
Let us set up the required packages for this notebook by importing paysage.
from __future__ import print_function, division
import os
import paysage
import numpy as np
import pandas as pd
# for Boltzmann machines
from paysage import preprocess as pre
from paysage.layers import BernoulliLayer, GaussianLayer
from paysage.models import BoltzmannMachine
from paysage import batch
from paysage import fit
from paysage import optimizers
from paysage import samplers
from paysage import backends as be
from paysage import schedules
from paysage import penalties as pen
# fix random seed to ensure deterministic behavior
np.random.seed(137)
be.set_seed(137)
As we mentioned in the introduction, we use the MNIST dataset of handwritten digits to study the Hopfield model and various variants of RBMs.
The MNIST dataset comprises $70000$ handwritten digits, each of which comes in a square image, divided into a $28\times 28$ pixel grid. Every pixel can take on $256$ nuances of the grey colour, interpolating between white and black, and hence each data point assumes any value in the set $\{0,1,\dots,255\}$. There are $10$ categories in the problem, corresponding to the ten digits. In previous notebooks, we formulated a classification task for the MNIST dataset, and studied it using discriminative supervised learning: Logistic Regression and Deep Neural Networks.
This dataset can be fetched using paysage from the web as an HDF5
file. For this, you should
/examples/mnist/
directory which contains the file download_mnist.py
.python3 /examples/mnist/download_mnist.py
This file contains keys train/images
, train/labels
, test/images
, and test/labels
and is compressed to about 15 Mb in size.
As our first step, we will set up the paths to the data and shuffle the dataset. The shuffled dataset will not be compressed (for faster reading during training) so it will be about 56 Mb in size.
Why shuffle the data? - Training with stochastic gradient descent means we will be using small minibatches of data (maybe 50 examples) to compute the gradient at each step. If the data have an order, then the estimates for the gradients computed from the minibatches will be biased. Shuffling the data ensures that the gradient estimates are unbiased (though still noisy).
paysage_path = os.path.dirname(os.path.dirname(paysage.__file__))
mnist_path = os.path.join(paysage_path, "examples", "mnist", "mnist.h5")
shuffled_mnist_path = os.path.join(paysage_path, "examples", "mnist", "shuffled_mnist.h5")
print("path to mnist data:")
print(mnist_path)
if not os.path.exists(mnist_path):
raise IOError("{} does not exist. run mnist/download_mnist.py to fetch from the web".format(mnist_path))
if not os.path.exists(shuffled_mnist_path):
batch.DataShuffler(mnist_path, shuffled_mnist_path, complevel=0).shuffle()
Next, we create a data
generator, which splits the data into a training and validation sets, and separates them into minibatches of size batch_size
. Before we begin training, we set data
into training mode.
To monitor the progress of performance metrics during training, we define the variable performance
which tells Paysage to measure the reconstruction error from the validation set. Possible metrics include the reconstruction error (used in this example) and metrics related to difference in energy of random samples and samples from the model (see metrics.md
in Paysage documentation for a complete list).
##### set up minibatch data generator
# batch size
batch_size=100
transform = pre.Transformation(pre.binarize_color)
# create data generator object with minibathces
samples = be.float_tensor(pd.read_hdf(shuffled_mnist_path, key='train/images').value)
data = batch.in_memory_batch(samples, batch_size, train_fraction=0.95, transform=transform)
# reset the data generator in training mode
data.reset_generator(mode='train')
Having loaded and preprocessed the data, we now move on to construct a hopfield
model. To do this, we use the Model
class and with a visible BernoulliLayer
and a hidden GaussianLayer
. Note that the visible layer has the same size as the input data points, which is can be read off data.ncols
. The number of hidden units is num_hidden_units
. We also set the mean and variance of the Gaussian layer to zero and unity, respectively (the notation here is inspired by the terminology in Variational Auto Encoders).
We choose to train the model with the Adam
optimizer. To ensure convergence, we attenuate the learning_rate
hyperparameter according to a PowerLawDecay
schedule: learning_rate
$(t)$ =initial
$/(1 +$ coefficient
$\;\times\; t)$. It will prove convenient to define the function Adam_optimizer()
for this purpose.
##### create hopfield model
# hidden units
num_hidden_units=200
# set up the model
vis_layer = BernoulliLayer(data.ncols)
hid_layer = GaussianLayer(num_hidden_units)
hopfield = BoltzmannMachine([vis_layer, hid_layer])
# set mean and standard deviation of hidden layer to to 0 and 1, respectively
hopfield.layers[1].set_fixed_params(['loc', 'log_var'])
# set up an optimizer method (ADAM in this case)
def ADAM_optimizer(initial,coefficient):
# define learning rate attenuation schedule
learning_rate=schedules.PowerLawDecay(initial=initial,coefficient=coefficient)
# return optimizer object
return optimizers.ADAM(stepsize=learning_rate)
Next, we have compile the model. First, we initialize a model
using the initialize
function attribute which accepts the data
as a required argument. We choose the initialization routine glorot
, see discussion in review. Second, we define an optimizer calling the function Adam_optimizer()
defined above, and store the object under the name opt
. To define a Monte Carlo sampler
, we use the method from_batch
of the SequentialMC
class, parsing the model
and the data
. Last, we create an SGD
object called trainer
to train the model using Persistent Contrastive Divergence (pcd
) with a fixed number of monte_carlo_steps
. We can also monitor
the reconstruction error during training. Last, we train the model in epochs (see variable nim_epochs
), calling the train()
method of trainer
. These steps are universal for shallow generative model
's, and it is convenient to combine them in the function train_model()
.
# define function to compile and train model
num_epochs=20 # training epochs
monte_carlo_steps=1 # number of MC sampling steps
def train_model(model,num_epochs,monte_carlo_steps):
# make a simple guess for the initial parameters of the model
model.initialize(data,method='glorot_normal')
# set optimizer
opt=ADAM_optimizer(1E-2,1.0)
trainer = fit.SGD(model, data)
trainer.train(opt, num_epochs, method=fit.pcd, mcsteps=monte_carlo_steps)
# train hopfield model
train_model(hopfield,num_epochs,monte_carlo_steps)
We can easily create a Bernoulli RBM and train it using the functions defined above as follows:
##### Bernoulli RBM
vis_layer = BernoulliLayer(data.ncols)
hid_layer = BernoulliLayer(num_hidden_units)
rbm = BoltzmannMachine([vis_layer, hid_layer])
# train Bernoulli RBM
train_model(rbm,num_epochs,monte_carlo_steps)
Constructing a Bernoulli RBM with L1 regularization is also straightforward in Paysage, using the add_penalty
method which accepts a dictionary as an input.
##### Bernoulli RBM with L1 regularizer
vis_layer = BernoulliLayer(data.ncols)
hid_layer = BernoulliLayer(num_hidden_units)
rbm_L1 = BoltzmannMachine([vis_layer, hid_layer])
rbm_L1.connections[0].weights.add_penalty({'matrix': pen.l1_penalty(1e-3)})
# train Bernoulli RBM with L1 regularizer
train_model(rbm_L1,num_epochs,monte_carlo_steps)
To define a deep Boltzmann machine (DBM), we just add more layers, and an L1 penalty for every layer.
Recalling the essential trick with layer-wise pre-training to prepare the weights of the DBM, we define a pretrainer
as an object of the LayerwisePretrain
class (see code snippet below). This results in a slight modification of the function train_model
, which we call train_deep_model
.
##### Deep Boltzmann Machine
# set up the model
dbm = BoltzmannMachine([BernoulliLayer(data.ncols), # visible layer
BernoulliLayer(num_hidden_units), # hidden layer 1
BernoulliLayer(num_hidden_units) # hidden layer 2
])
# add an L1 penalty to the weights
for conn in dbm.connections:
conn.weights.add_penalty({'matrix':pen.l1_penalty(1e-3)})
# add pre-training
def train_deep_model(model,num_epochs,monte_carlo_steps):
# make a simple guess for the initial parameters of the model
model.initialize(data,method='glorot_normal')
# set SGD rπetrain optimizer
opt=ADAM_optimizer(1E-2,1.0)
# pre-train model
pretrainer=fit.LayerwisePretrain(model,data)
pretrainer.train(opt, num_epochs, method=fit.pcd, mcsteps=monte_carlo_steps, init_method="glorot_normal")
# set SGD train optimizer
opt=ADAM_optimizer(1E-3,1.0)
# train model
trainer=fit.SGD(model,data)
trainer.train(opt,num_epochs,method=fit.pcd,mcsteps=monte_carlo_steps)
# train DBM
train_deep_model(dbm,num_epochs,monte_carlo_steps)
Let us look at a couple of random examples to get an idea how the data we are dealing with actually looks like.
To do this, we define the function plot_image_grid()
.
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.gridspec as gs
import matplotlib.cm as cm
import seaborn as sns
# make sure the plots are shown in the notebook
def plot_image_grid(image_array, shape, vmin=0, vmax=1, cmap=cm.gray_r, row_titles=None):
array = be.to_numpy_array(image_array)
nrows, ncols = array.shape[:-1]
f = plt.figure(figsize=(2*ncols, 2*nrows))
grid = gs.GridSpec(nrows, ncols)
axes = [[plt.subplot(grid[i,j]) for j in range(ncols)] for i in range(nrows)]
for i in range(nrows):
for j in range(ncols):
sns.heatmap(np.reshape(array[i][j], shape),
ax=axes[i][j], cmap=cmap, cbar=False, vmin=vmin, vmax=vmax)
axes[i][j].set(yticks=[])
axes[i][j].set(xticks=[])
if row_titles is not None:
for i in range(nrows):
axes[i][0].set_ylabel(row_titles[i], fontsize=36)
plt.tight_layout()
plt.show(f)
plt.close(f)
image_shape = (28, 28) # 28x28 = 784 pixels in every image
num_to_plot = 8 # of data points to plot
examples = data.get(mode='train') # shape (batch_size, 784)
example_plot = plot_image_grid(np.expand_dims(examples[:num_to_plot], 0),
image_shape, vmin=0, vmax=1)
Having trained our models, let us see how they perform by computing some reconstructions from the validation data.
Recall that a reconstruction ${\bf v'}$ of a given data point ${\bf x}$ is computed in two steps: (i) we fix the visible layer ${\bf v}={\bf x}$ to be the data, and use MC sampling to find the state of the hidden layer ${\bf h}$ which maximizes the probability distribution $p({\bf h}\vert{\bf v})$, (ii) fixing the same obtained state ${\bf h}$, we find the reconstruction of the visible layer ${\bf v'}$ which maximizes the probability $p({\bf v'}\vert{\bf h})$. In the case of a DBM, the forward pass continues until we reach the last of the hidden layers, and the backward pass goes in reverse.
To compute reconstructions, we define a MC sampler
based on the trained model
. The stating point form the MC sampler is set using the set_state()
method. To compute reconstructions, we need to keep the probability distribution encoded in the model
fixed which is done with the help of the deterministic_iteration
function method, which takes the number of weights num_weights
in the model, and the state of the sampler sampler.state
as required arguments. We can combine these steps in the function compute_reconstructions
.
##### compute reconstructions
def compute_reconstructions(model, data):
"""
Computes reconstructions of the input data.
Input v -> h -> v' (one pass up one pass down)
Args:
model: a model
data: a tensor of shape (num_samples, num_visible_units)
Returns:
tensor of shape (num_samples, num_visible_units)
"""
recons = model.compute_reconstructions(data).get_visible()
return be.to_numpy_array(recons)
examples = data.get(mode='validate') # shape (batch_size, 784)
data.reset_generator(mode='validate') # reset the generator to the beginning of the validation set
hopfield_reconstructions = compute_reconstructions(hopfield, examples[:num_to_plot])
rbm_reconstructions = compute_reconstructions(rbm, examples[:num_to_plot])
rbm_L1_reconstructions = compute_reconstructions(rbm_L1, examples[:num_to_plot])
dbm_reconstructions = compute_reconstructions(dbm, examples[:num_to_plot])
reconstruction_plot = plot_image_grid(
np.array([examples[:num_to_plot],
hopfield_reconstructions,
rbm_reconstructions,
rbm_L1_reconstructions,
dbm_reconstructions]),
image_shape, vmin=0, vmax=1, row_titles=["Data", "Hopfield", "RBM", "RBM (L1)", "DBM"])
Once we have the trained models ready, we can use MC to draw samples from the corresponding probability distributions, called "fantasy particles". To this end let us draw a random_sample
from the validation data, and compute the model_state
. Next, we define a MC sampler
based on the model
, and set its state to model_state
. To compute the fantasy particles, we do layer-wise Gibbs sampling for a total of n_steps
equilibration steps. The last step (controlled by the boolean mean_field
) is a final mean-field iteration.
def compute_fantasy_particles(model,num_fantasy,num_steps,mean_field=True):
"""
Draws samples from the model using Gibbs sampling Markov Chain Monte Carlo .
Starts from randomly initialized points.
Args:
model: a model
data: a tensor of shape (num_samples, num_visible_units)
num_steps (int): the number of update steps
mean_field (bool; optional): run a final mean field step to compute probabilities
Returns:
tensor of shape (num_samples, num_visible_units)
"""
schedule = schedules.Linear(initial=1.0, delta = 1 / (num_steps-1))
fantasy = samplers.SequentialMC.generate_fantasy_state(model,
num_fantasy,
num_steps,
schedule=schedule,
beta_std=0.0,
beta_momentum=0.0)
if mean_field:
fantasy = model.mean_field_iteration(1, fantasy)
fantasy_particles = fantasy.get_visible()
return be.to_numpy_array(fantasy_particles)
examples = data.get(mode='validate') # shape (batch_size, 784)
data.reset_generator(mode='validate') # reset the generator to the beginning of the validation set
hopfield_fantasy = compute_fantasy_particles(hopfield, num_to_plot, 100, mean_field=False)
rbm_fantasy = compute_fantasy_particles(rbm, num_to_plot, 100, mean_field=False)
rbm_L1_fantasy = compute_fantasy_particles(rbm_L1, num_to_plot, 100, mean_field=False)
dbm_fantasy = compute_fantasy_particles(dbm, num_to_plot, 100, mean_field=False)
fantasy_plot = plot_image_grid(
np.array([hopfield_fantasy,
rbm_fantasy,
rbm_L1_fantasy,
dbm_fantasy]),
image_shape, vmin=0, vmax=1, row_titles=["Hopfield", "RBM", "RBM (L1)", "DBM"])
One can use generative models to reduce the noise in images (de-noising). Let us randomly flip a fraction, fraction_to_flip
, of the black & white bits in the validation data, and use the models defined above to reconstruct (de-noise) the digit images:
##### denoise MNIST images
# get validation data
examples = data.get(mode='validate') # shape (batch_size, 784)
# reset data generator to beginning of the validation set
data.reset_generator(mode='validate')
# add some noise to the examples by randomly flipping some pixels 0 -> 1 and 1 -> 0
fraction_to_flip=0.15
# create flipping mask
flip_mask=be.rand_like(examples) < fraction_to_flip
# compute noisy data
noisy_data=(1-flip_mask) * examples + flip_mask * (1 - examples)
# define number of digits to display
num_to_display=8
# compute de-noised images
hopfield_denoised=compute_reconstructions(hopfield,noisy_data[:num_to_display])
rbm_denoised=compute_reconstructions(rbm,noisy_data[:num_to_display])
rbm_L1_denoised=compute_reconstructions(rbm_L1,noisy_data[:num_to_display])
dbm_denoised=compute_reconstructions(dbm,noisy_data[:num_to_display])
denoising_plot = plot_image_grid(
np.array([examples[:num_to_plot],
noisy_data[:num_to_plot],
hopfield_denoised,
rbm_denoised,
rbm_L1_denoised,
dbm_denoised]),
image_shape, vmin=0, vmax=1, row_titles=["Data", "Noisy", "Hopfield", "RBM", "RBM (L1)", "DBM"])
Let us open up the black box of our generative models now. Below, we show the features learned by the weights of the different models.
# plot the weights of the hopfield model
hopfield_weights = plot_image_grid(
be.reshape(hopfield.connections[0].weights.W(trans=True)[:25], (5,5,784)),
image_shape,
vmin=be.tmin(hopfield.connections[0].weights.W()),
vmax=be.tmax(hopfield.connections[0].weights.W()),
cmap=cm.seismic)
# plot the weights of the RBM
rbm_weights = plot_image_grid(
be.reshape(rbm.connections[0].weights.W(trans=True)[:25], (5,5,784)),
image_shape,
vmin=be.tmin(rbm.connections[0].weights.W()),
vmax=be.tmax(rbm.connections[0].weights.W()),
cmap=cm.seismic)
# plot the weights of the L1 regularized RBM
rbmL1_weights = plot_image_grid(
be.reshape(rbm_L1.connections[0].weights.W(trans=True)[:25], (5,5,784)),
image_shape,
vmin=be.tmin(rbm_L1.connections[0].weights.W()),
vmax=be.tmax(rbm_L1.connections[0].weights.W()),
cmap=cm.seismic)
# plot the weights of the first layer of the dbm
dbm_weights = plot_image_grid(
be.reshape(dbm.connections[0].weights.W(trans=True)[:25], (5,5,784)),
image_shape,
vmin=be.tmin(dbm.connections[0].weights.W()),
vmax=be.tmax(dbm.connections[0].weights.W()),
cmap=cm.seismic)
data.close() # close the HDF5 store with the MNIST dataset