The goal of this notebook is to introduce deep neural networks (DNNs) using the high-level Keras package. The reader will become familiar with how to choose an architecture, cost function, and optimizer in Keras. We will also learn how to train neural networks.
We will once again work with the MNIST dataset of hand written digits introduced in Notebook 7: Logistic Regression (MNIST). The goal is to find a statistical model which recognizes and distinguishes between the ten handwritten digits (0-9).
The MNIST dataset comprises $70000$ handwritten digits, each of which comes in a square image, divided into a $28\times 28$ pixel grid. Every pixel can take on $256$ nuances of the gray color, interpolating between white and black, and hence each data point assumes any value in the set $\{0,1,\dots,255\}$. Since there are $10$ categories in the problem, corresponding to the ten digits, this problem represents a generic classification task.
In this Notebook, we show how to use the Keras python package to tackle the MNIST problem with the help of deep neural networks.
The following code is a slight modification of a Keras tutorial, see https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py. We invite the reader to read Sec. IX of the review to acquire a broad understanding of what the separate parts of the code do.
from __future__ import print_function
import keras,sklearn
# suppress tensorflow compilation warnings
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import numpy as np
seed=0
np.random.seed(seed) # fix random seed
tf.set_random_seed(seed)
import matplotlib.pyplot as plt
Constructing a Deep Neural Network to solve ML problems is a multiple-stage process. Quite generally, one can identify the key steps as follows:
We would like to emphasize that, while it is always possible to view steps 1-5 as independent of the particular task we are trying to solve, it is only when they are put together in step 6 that the real gain of using Deep Learning is revealed, compared to less sophisticated methods such as the regression models or bagging, described in Secs. VII and VIII of the review. With this remark in mind, we shall focus predominantly on steps 1-5 below. We show how one can use grid search methods to find optimal hyperparameters in step 6.
Keras can conveniently download the MNIST data from the web. All we need to do is import the mnist
module and use the load_data()
class, and it will create the training and test data sets or us.
The MNIST set has pre-defined test and training sets, in order to facilitate the comparison of the performance of different models on the data.
Once we have loaded the data, we need to format it in the correct shape. This differs from one package to the other and, as we see in the case of Keras, it can even be different depending on the backend used.
While choosing the correct datatype
can help improve the computational speed, we emphasize the rescaling step, which is necessary to avoid large variations in the minimal and maximal possible values of each feature. In other words, we want to make sure a feature is not being over-represented just because it is "large".
Last, we cast the label vectors $y$ to binary class matrices (a.k.a. one-hot format), as explained in Sec. VII on SoftMax regression.
from keras.datasets import mnist
# input image dimensions
num_classes = 10 # 10 digits
img_rows, img_cols = 28, 28 # number of pixels
# the data, shuffled and split between train and test sets
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# reshape data, depending on Keras backend
X_train = X_train.reshape(X_train.shape[0], img_rows*img_cols)
X_test = X_test.reshape(X_test.shape[0], img_rows*img_cols)
# cast floats to single precesion
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# rescale data in interval [0,1]
X_train /= 255
X_test /= 255
# look at an example of data point
print('an example of a data point with label', Y_train[20])
plt.matshow(X_train[20,:].reshape(28,28),cmap='binary')
plt.show()
# convert class vectors to binary class matrices
Y_train = keras.utils.to_categorical(Y_train, num_classes)
Y_test = keras.utils.to_categorical(Y_test, num_classes)
print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print()
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
We can now move on to construct our deep neural net. We shall use Keras's Sequential()
class to instantiate a model, and will add different deep layers one by one.
At this stage, we refrain from using convolutional layers. This is done further below.
Let us create an instance of Keras' Sequential()
class, called model
. As the name suggests, this class allows us to build DNNs layer by layer. We use the add()
method to attach layers to our model. For the purposes of our introductory example, it suffices to focus on Dense
layers for simplicity. Every Dense()
layer accepts as its first required argument an integer which specifies the number of neurons. The type of activation function for the layer is defined using the activation
optional argument, the input of which is the name of the activation function in string
format. Examples include relu
, tanh
, elu
, sigmoid
, softmax
.
In order for our DNN to work properly, we have to make sure that the numbers of input and output neurons for each layer match. Therefore, we specify the shape of the input in the first layer of the model explicitly using the optional argument input_shape=(N_features,)
. The sequential construction of the model then allows Keras to infer the correct input/output dimensions of all hidden layers automatically. Hence, we only need to specify the size of the softmax output layer to match the number of categories.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
def create_DNN():
# instantiate model
model = Sequential()
# add a dense all-to-all relu layer
model.add(Dense(400,input_shape=(img_rows*img_cols,), activation='relu'))
# add a dense all-to-all relu layer
model.add(Dense(100, activation='relu'))
# apply dropout with rate 0.5
model.add(Dropout(0.5))
# soft-max layer
model.add(Dense(num_classes, activation='softmax'))
return model
print('Model architecture created successfully!')
Next, we choose the loss function according to which to train the DNN. For classification problems, this is the cross entropy, and since the output data was cast in categorical form, we choose the categorical_crossentropy
defined in Keras' losses
module. Depending on the problem of interest one can pick any other suitable loss function. To optimize the weights of the net, we choose SGD. This algorithm is already available to use under Keras' optimizers
module, but we could use Adam()
or any other built-in one as well. The parameters for the optimizer, such as lr
(learning rate) or momentum
are passed using the corresponding optional arguments of the SGD()
function. All available arguments can be found in Keras' online documentation at https://keras.io/. While the loss function and the optimizer are essential for the training procedure, to test the performance of the model one may want to look at a particular metric
of performance. For instance, in categorical tasks one typically looks at their accuracy
, which is defined as the percentage of correctly classified data points. To complete the definition of our model, we use the compile()
method, with optional arguments for the optimizer
, loss
, and the validation metric
as follows:
def compile_model(optimizer=keras.optimizers.Adam()):
# create the mode
model=create_DNN()
# compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=optimizer,
metrics=['accuracy'])
return model
print('Model compiled successfully and ready to be trained.')
We train our DNN in minibatches, the advantages of which were explained in Sec. IV.
Shuffling the training data during training improves stability of the model. Thus, we train over a number of training epochs.
Training the DNN is a one-liner using the fit()
method of the Sequential
class. The first two required arguments are the training input and output data. As optional arguments, we specify the mini-batch_size
, the number of training epochs
, and the test or validation_data
. To monitor the training procedure for every epoch, we set verbose=True
.
# training parameters
batch_size = 64
epochs = 10
# create the deep neural net
model_DNN=compile_model()
# train DNN and store training info in history
history=model_DNN.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(X_test, Y_test))
Next, we evaluate the model and read of the loss on the test data, and its accuracy using the evaluate()
method.
# evaluate model
score = model_DNN.evaluate(X_test, Y_test, verbose=1)
# print performance
print()
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# look into training history
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.ylabel('model accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='best')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('model loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='best')
plt.show()
Last, we show how to use the grid search option of scikit-learn to optimize the hyperparameters of our model. An excellent blog on this by Jason Brownlee can be found on https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/.
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
# call Keras scikit wrapper
model_gridsearch = KerasClassifier(build_fn=compile_model,
epochs=1,
batch_size=batch_size,
verbose=1)
# list of allowed optional arguments for the optimizer, see `compile_model()`
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
# define parameter dictionary
param_grid = dict(optimizer=optimizer)
# call scikit grid search module
grid = GridSearchCV(estimator=model_gridsearch, param_grid=param_grid, n_jobs=1, cv=4)
grid_result = grid.fit(X_train,Y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
We have so far considered each MNIST data sample as a $(28\times 28,)$-long 1d vector. This approach neglects any spatial structure in the image. On the other hand, we do know that in every one of the hand-written digits there are local spatial correlations between the pixels, which we would like to take advantage of to improve the accuracy of our classification model. To this end, we first need to reshape the training and test input data as follows
# reshape data, depending on Keras backend
if keras.backend.image_data_format() == 'channels_first':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print()
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
One can ask the question of whether a neural net can learn to recognize such local patterns. As we saw in Sec. X of the review, this can be achieved by using convolutional layers. Luckily, all we need to do is change the architecture of our DNN, i.e. introduce small changes to the function create_model()
. We can also merge Step 2 and Step 3 for convenience:
def create_CNN():
# instantiate model
model = Sequential()
# add first convolutional layer with 10 filters (dimensionality of output space)
model.add(Conv2D(10, kernel_size=(5, 5),
activation='relu',
input_shape=input_shape))
# add 2D pooling layer
model.add(MaxPooling2D(pool_size=(2, 2)))
# add second convolutional layer with 20 filters
model.add(Conv2D(20, (5, 5), activation='relu'))
# apply dropout with rate 0.5
model.add(Dropout(0.5))
# add 2D pooling layer
model.add(MaxPooling2D(pool_size=(2, 2)))
# flatten data
model.add(Flatten())
# add a dense all-to-all relu layer
model.add(Dense(20*4*4, activation='relu'))
# apply dropout with rate 0.5
model.add(Dropout(0.5))
# soft-max layer
model.add(Dense(num_classes, activation='softmax'))
# compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer='Adam',
metrics=['accuracy'])
return model
Training the deep conv net (Step 4) and evaluating its performance (Step 6) proceeds exactly as before:
# training parameters
batch_size = 64
epochs = 10
# create the deep conv net
model_CNN=create_CNN()
# train CNN
model_CNN.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(X_test, Y_test))
# evaliate model
score = model_CNN.evaluate(X_test, Y_test, verbose=1)
# print performance
print()
print('Test loss:', score[0])
print('Test accuracy:', score[1])