# Notebook 11: Introduction to Deep Neural Networks with Keras¶

## Learning Goals¶

The goal of this notebook is to introduce deep neural networks (DNNs) using the high-level Keras package. The reader will become familiar with how to choose an architecture, cost function, and optimizer in Keras. We will also learn how to train neural networks.

# MNIST with Keras¶

We will once again work with the MNIST dataset of hand written digits introduced in Notebook 7: Logistic Regression (MNIST). The goal is to find a statistical model which recognizes and distinguishes between the ten handwritten digits (0-9).

The MNIST dataset comprises $70000$ handwritten digits, each of which comes in a square image, divided into a $28\times 28$ pixel grid. Every pixel can take on $256$ nuances of the gray color, interpolating between white and black, and hence each data point assumes any value in the set $\{0,1,\dots,255\}$. Since there are $10$ categories in the problem, corresponding to the ten digits, this problem represents a generic classification task.

In this Notebook, we show how to use the Keras python package to tackle the MNIST problem with the help of deep neural networks.

The following code is a slight modification of a Keras tutorial, see https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py. We invite the reader to read Sec. IX of the review to acquire a broad understanding of what the separate parts of the code do.

In [1]:
from __future__ import print_function
import keras,sklearn
# suppress tensorflow compilation warnings
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import numpy as np
seed=0
np.random.seed(seed) # fix random seed
tf.set_random_seed(seed)
import matplotlib.pyplot as plt

Using TensorFlow backend.


## Structure of the Procedure¶

Constructing a Deep Neural Network to solve ML problems is a multiple-stage process. Quite generally, one can identify the key steps as follows:

• step 1: Load and process the data
• step 2: Define the model and its architecture
• step 3: Choose the optimizer and the cost function
• step 4: Train the model
• step 5: Evaluate the model performance on the unseen test data
• step 6: Modify the hyperparameters to optimize performance for the specific data set

We would like to emphasize that, while it is always possible to view steps 1-5 as independent of the particular task we are trying to solve, it is only when they are put together in step 6 that the real gain of using Deep Learning is revealed, compared to less sophisticated methods such as the regression models or bagging, described in Secs. VII and VIII of the review. With this remark in mind, we shall focus predominantly on steps 1-5 below. We show how one can use grid search methods to find optimal hyperparameters in step 6.

### Step 1: Load and Process the Data¶

Keras can conveniently download the MNIST data from the web. All we need to do is import the mnist module and use the load_data() class, and it will create the training and test data sets or us.

The MNIST set has pre-defined test and training sets, in order to facilitate the comparison of the performance of different models on the data.

Once we have loaded the data, we need to format it in the correct shape. This differs from one package to the other and, as we see in the case of Keras, it can even be different depending on the backend used.

While choosing the correct datatype can help improve the computational speed, we emphasize the rescaling step, which is necessary to avoid large variations in the minimal and maximal possible values of each feature. In other words, we want to make sure a feature is not being over-represented just because it is "large".

Last, we cast the label vectors $y$ to binary class matrices (a.k.a. one-hot format), as explained in Sec. VII on SoftMax regression.

In [2]:
from keras.datasets import mnist

# input image dimensions
num_classes = 10 # 10 digits

img_rows, img_cols = 28, 28 # number of pixels

# the data, shuffled and split between train and test sets
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

# reshape data, depending on Keras backend
X_train = X_train.reshape(X_train.shape[0], img_rows*img_cols)
X_test = X_test.reshape(X_test.shape[0], img_rows*img_cols)

# cast floats to single precesion
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# rescale data in interval [0,1]
X_train /= 255
X_test /= 255

# look at an example of data point
print('an example of a data point with label', Y_train[20])
plt.matshow(X_train[20,:].reshape(28,28),cmap='binary')
plt.show()

# convert class vectors to binary class matrices
Y_train = keras.utils.to_categorical(Y_train, num_classes)
Y_test = keras.utils.to_categorical(Y_test, num_classes)

print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print()
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 2s 0us/step
an example of a data point with label 4

X_train shape: (60000, 784)
Y_train shape: (60000, 10)

60000 train samples
10000 test samples


### Step 2: Define the Neural Net and its Architecture¶

We can now move on to construct our deep neural net. We shall use Keras's Sequential() class to instantiate a model, and will add different deep layers one by one.

At this stage, we refrain from using convolutional layers. This is done further below.

Let us create an instance of Keras' Sequential() class, called model. As the name suggests, this class allows us to build DNNs layer by layer. We use the add() method to attach layers to our model. For the purposes of our introductory example, it suffices to focus on Dense layers for simplicity. Every Dense() layer accepts as its first required argument an integer which specifies the number of neurons. The type of activation function for the layer is defined using the activation optional argument, the input of which is the name of the activation function in string format. Examples include relu, tanh, elu, sigmoid, softmax.

In order for our DNN to work properly, we have to make sure that the numbers of input and output neurons for each layer match. Therefore, we specify the shape of the input in the first layer of the model explicitly using the optional argument input_shape=(N_features,). The sequential construction of the model then allows Keras to infer the correct input/output dimensions of all hidden layers automatically. Hence, we only need to specify the size of the softmax output layer to match the number of categories.

In [3]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

def create_DNN():
# instantiate model
model = Sequential()
# add a dense all-to-all relu layer
# add a dense all-to-all relu layer
# apply dropout with rate 0.5
# soft-max layer

return model

print('Model architecture created successfully!')

Model architecture created successfully!


### Step 3: Choose the Optimizer and the Cost Function¶

Next, we choose the loss function according to which to train the DNN. For classification problems, this is the cross entropy, and since the output data was cast in categorical form, we choose the categorical_crossentropy defined in Keras' losses module. Depending on the problem of interest one can pick any other suitable loss function. To optimize the weights of the net, we choose SGD. This algorithm is already available to use under Keras' optimizers module, but we could use Adam() or any other built-in one as well. The parameters for the optimizer, such as lr (learning rate) or momentum are passed using the corresponding optional arguments of the SGD() function. All available arguments can be found in Keras' online documentation at https://keras.io/. While the loss function and the optimizer are essential for the training procedure, to test the performance of the model one may want to look at a particular metric of performance. For instance, in categorical tasks one typically looks at their accuracy, which is defined as the percentage of correctly classified data points. To complete the definition of our model, we use the compile() method, with optional arguments for the optimizer, loss, and the validation metric as follows:

In [4]:
def compile_model(optimizer=keras.optimizers.Adam()):
# create the mode
model=create_DNN()
# compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=optimizer,
metrics=['accuracy'])
return model

print('Model compiled successfully and ready to be trained.')

Model compiled successfully and ready to be trained.


### Step 4: Train the model¶

We train our DNN in minibatches, the advantages of which were explained in Sec. IV.

Shuffling the training data during training improves stability of the model. Thus, we train over a number of training epochs.

Training the DNN is a one-liner using the fit() method of the Sequential class. The first two required arguments are the training input and output data. As optional arguments, we specify the mini-batch_size, the number of training epochs, and the test or validation_data. To monitor the training procedure for every epoch, we set verbose=True.

In [5]:
# training parameters
batch_size = 64
epochs = 10

# create the deep neural net
model_DNN=compile_model()

# train DNN and store training info in history
history=model_DNN.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(X_test, Y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 12s 207us/step - loss: 0.3138 - acc: 0.9072 - val_loss: 0.1330 - val_acc: 0.9578
Epoch 2/10
60000/60000 [==============================] - 12s 192us/step - loss: 0.1285 - acc: 0.9627 - val_loss: 0.0942 - val_acc: 0.9711
Epoch 3/10
60000/60000 [==============================] - 12s 193us/step - loss: 0.0889 - acc: 0.9738 - val_loss: 0.0772 - val_acc: 0.9765
Epoch 4/10
60000/60000 [==============================] - 12s 194us/step - loss: 0.0696 - acc: 0.9793 - val_loss: 0.0737 - val_acc: 0.9787
Epoch 5/10
60000/60000 [==============================] - 12s 195us/step - loss: 0.0571 - acc: 0.9828 - val_loss: 0.0745 - val_acc: 0.9776
Epoch 6/10
60000/60000 [==============================] - 12s 198us/step - loss: 0.0452 - acc: 0.9858 - val_loss: 0.0799 - val_acc: 0.9782
Epoch 7/10
60000/60000 [==============================] - 12s 195us/step - loss: 0.0382 - acc: 0.9878 - val_loss: 0.0839 - val_acc: 0.9779
Epoch 8/10
60000/60000 [==============================] - 12s 197us/step - loss: 0.0337 - acc: 0.9892 - val_loss: 0.0781 - val_acc: 0.9792
Epoch 9/10
60000/60000 [==============================] - 12s 201us/step - loss: 0.0295 - acc: 0.9909 - val_loss: 0.0755 - val_acc: 0.9818
Epoch 10/10
60000/60000 [==============================] - 12s 199us/step - loss: 0.0257 - acc: 0.9921 - val_loss: 0.0863 - val_acc: 0.9792


### Step 5: Evaluate the Model Performance on the Unseen Test Data¶

Next, we evaluate the model and read of the loss on the test data, and its accuracy using the evaluate() method.

In [6]:
# evaluate model
score = model_DNN.evaluate(X_test, Y_test, verbose=1)

# print performance
print()
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# look into training history

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.ylabel('model accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='best')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('model loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='best')
plt.show()

10000/10000 [==============================] - 0s 36us/step

Test loss: 0.07659573335081113
Test accuracy: 0.9816


### Step 6: Modify the Hyperparameters to Optimize Performance of the Model¶

Last, we show how to use the grid search option of scikit-learn to optimize the hyperparameters of our model. An excellent blog on this by Jason Brownlee can be found on https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/.

In [7]:
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier

# call Keras scikit wrapper
model_gridsearch = KerasClassifier(build_fn=compile_model,
epochs=1,
batch_size=batch_size,
verbose=1)

# list of allowed optional arguments for the optimizer, see compile_model()
# define parameter dictionary
param_grid = dict(optimizer=optimizer)
# call scikit grid search module
grid = GridSearchCV(estimator=model_gridsearch, param_grid=param_grid, n_jobs=1, cv=4)
grid_result = grid.fit(X_train,Y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))

Epoch 1/1
45000/45000 [==============================] - 5s 118us/step - loss: 1.1397 - acc: 0.6571
15000/15000 [==============================] - 1s 38us/step
45000/45000 [==============================] - 1s 23us/step
Epoch 1/1
45000/45000 [==============================] - 5s 117us/step - loss: 1.1162 - acc: 0.6668
15000/15000 [==============================] - 1s 39us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 5s 117us/step - loss: 1.1128 - acc: 0.6707
15000/15000 [==============================] - 1s 38us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 5s 120us/step - loss: 1.1454 - acc: 0.6564
15000/15000 [==============================] - 1s 42us/step
45000/45000 [==============================] - 1s 24us/step
Epoch 1/1
45000/45000 [==============================] - 8s 177us/step - loss: 0.3399 - acc: 0.8988
15000/15000 [==============================] - 1s 42us/step
45000/45000 [==============================] - 1s 25us/step
Epoch 1/1
45000/45000 [==============================] - 8s 177us/step - loss: 0.3430 - acc: 0.8996
15000/15000 [==============================] - 1s 44us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 8s 183us/step - loss: 0.3414 - acc: 0.8996
15000/15000 [==============================] - 1s 45us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 8s 180us/step - loss: 0.3402 - acc: 0.8990
15000/15000 [==============================] - 1s 45us/step
45000/45000 [==============================] - 1s 28us/step
Epoch 1/1
45000/45000 [==============================] - 8s 169us/step - loss: 0.3175 - acc: 0.9076
15000/15000 [==============================] - 1s 47us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 8s 169us/step - loss: 0.3346 - acc: 0.9031
15000/15000 [==============================] - 1s 51us/step
45000/45000 [==============================] - 1s 27us/step
Epoch 1/1
45000/45000 [==============================] - 8s 170us/step - loss: 0.3227 - acc: 0.9070
15000/15000 [==============================] - 1s 47us/step
45000/45000 [==============================] - 1s 27us/step
Epoch 1/1
45000/45000 [==============================] - 8s 171us/step - loss: 0.3497 - acc: 0.8977
15000/15000 [==============================] - 1s 52us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 10s 231us/step - loss: 0.3711 - acc: 0.8901
15000/15000 [==============================] - 1s 52us/step
45000/45000 [==============================] - 1s 26us/step
Epoch 1/1
45000/45000 [==============================] - 10s 225us/step - loss: 0.3731 - acc: 0.8888
15000/15000 [==============================] - 1s 45us/step
45000/45000 [==============================] - 1s 22us/step
Epoch 1/1
45000/45000 [==============================] - 10s 223us/step - loss: 0.3653 - acc: 0.8930
15000/15000 [==============================] - 1s 48us/step
45000/45000 [==============================] - 1s 23us/step
Epoch 1/1
45000/45000 [==============================] - 9s 203us/step - loss: 0.3721 - acc: 0.8891
15000/15000 [==============================] - 1s 44us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 8s 167us/step - loss: 0.3551 - acc: 0.8951
15000/15000 [==============================] - 1s 48us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 7s 166us/step - loss: 0.3561 - acc: 0.8954
15000/15000 [==============================] - 1s 48us/step
45000/45000 [==============================] - 1s 19us/step
Epoch 1/1
45000/45000 [==============================] - 7s 166us/step - loss: 0.3496 - acc: 0.8962
15000/15000 [==============================] - 1s 50us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 8s 171us/step - loss: 0.3621 - acc: 0.8937
15000/15000 [==============================] - 1s 52us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 6s 140us/step - loss: 0.4106 - acc: 0.8811
15000/15000 [==============================] - 1s 55us/step
45000/45000 [==============================] - 1s 22us/step
Epoch 1/1
45000/45000 [==============================] - 6s 143us/step - loss: 0.4125 - acc: 0.8780
15000/15000 [==============================] - 1s 56us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 6s 142us/step - loss: 0.3998 - acc: 0.8839
15000/15000 [==============================] - 1s 56us/step
45000/45000 [==============================] - 1s 21us/step
Epoch 1/1
45000/45000 [==============================] - 6s 142us/step - loss: 0.4139 - acc: 0.8779
15000/15000 [==============================] - 1s 62us/step
45000/45000 [==============================] - 1s 22us/step
Epoch 1/1
45000/45000 [==============================] - 9s 206us/step - loss: 0.2943 - acc: 0.9140
15000/15000 [==============================] - 1s 55us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 9s 206us/step - loss: 0.3047 - acc: 0.9111
15000/15000 [==============================] - 1s 61us/step
45000/45000 [==============================] - 1s 20us/step
Epoch 1/1
45000/45000 [==============================] - 9s 208us/step - loss: 0.2920 - acc: 0.9137
15000/15000 [==============================] - 1s 60us/step
45000/45000 [==============================] - 1s 21us/step
Epoch 1/1
45000/45000 [==============================] - 9s 209us/step - loss: 0.3015 - acc: 0.9112
15000/15000 [==============================] - 1s 62us/step
45000/45000 [==============================] - 1s 22us/step
Epoch 1/1
60000/60000 [==============================] - 12s 199us/step - loss: 0.2694 - acc: 0.9210
Best: 0.957033 using {'optimizer': 'Nadam'}
0.874133 (0.005897) with: {'optimizer': 'SGD'}
0.953867 (0.003900) with: {'optimizer': 'RMSprop'}
0.955000 (0.002867) with: {'optimizer': 'Adam'}
0.945917 (0.002984) with: {'optimizer': 'Adamax'}
0.957033 (0.001874) with: {'optimizer': 'Nadam'}


## Creating Convolutional Neural Nets with Keras¶

We have so far considered each MNIST data sample as a $(28\times 28,)$-long 1d vector. This approach neglects any spatial structure in the image. On the other hand, we do know that in every one of the hand-written digits there are local spatial correlations between the pixels, which we would like to take advantage of to improve the accuracy of our classification model. To this end, we first need to reshape the training and test input data as follows

In [8]:
# reshape data, depending on Keras backend
if keras.backend.image_data_format() == 'channels_first':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print()
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

X_train shape: (60000, 28, 28, 1)
Y_train shape: (60000, 10)

60000 train samples
10000 test samples


One can ask the question of whether a neural net can learn to recognize such local patterns. As we saw in Sec. X of the review, this can be achieved by using convolutional layers. Luckily, all we need to do is change the architecture of our DNN, i.e. introduce small changes to the function create_model(). We can also merge Step 2 and Step 3 for convenience:

In [9]:
def create_CNN():
# instantiate model
model = Sequential()
# add first convolutional layer with 10 filters (dimensionality of output space)
activation='relu',
input_shape=input_shape))
# add 2D pooling layer
# add second convolutional layer with 20 filters
model.add(Conv2D(20, (5, 5), activation='relu'))
# apply dropout with rate 0.5
# add 2D pooling layer
# flatten data
# add a dense all-to-all relu layer
# apply dropout with rate 0.5
# soft-max layer

# compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
metrics=['accuracy'])

return model


Training the deep conv net (Step 4) and evaluating its performance (Step 6) proceeds exactly as before:

In [10]:
# training parameters
batch_size = 64
epochs = 10

# create the deep conv net
model_CNN=create_CNN()

# train CNN
model_CNN.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(X_test, Y_test))

# evaliate model
score = model_CNN.evaluate(X_test, Y_test, verbose=1)

# print performance
print()
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 39s 646us/step - loss: 0.2610 - acc: 0.9183 - val_loss: 0.0750 - val_acc: 0.9825
Epoch 2/10
60000/60000 [==============================] - 32s 533us/step - loss: 0.0932 - acc: 0.9709 - val_loss: 0.0551 - val_acc: 0.9876
Epoch 3/10
60000/60000 [==============================] - 32s 533us/step - loss: 0.0689 - acc: 0.9789 - val_loss: 0.0408 - val_acc: 0.9893
Epoch 4/10
60000/60000 [==============================] - 32s 533us/step - loss: 0.0622 - acc: 0.9808 - val_loss: 0.0378 - val_acc: 0.9907
Epoch 5/10
60000/60000 [==============================] - 32s 537us/step - loss: 0.0531 - acc: 0.9837 - val_loss: 0.0333 - val_acc: 0.9915
Epoch 6/10
60000/60000 [==============================] - 33s 554us/step - loss: 0.0495 - acc: 0.9854 - val_loss: 0.0339 - val_acc: 0.9909
Epoch 7/10
60000/60000 [==============================] - 39s 649us/step - loss: 0.0475 - acc: 0.9851 - val_loss: 0.0330 - val_acc: 0.9929
Epoch 8/10
60000/60000 [==============================] - 39s 648us/step - loss: 0.0423 - acc: 0.9870 - val_loss: 0.0289 - val_acc: 0.9931
Epoch 9/10
60000/60000 [==============================] - 38s 633us/step - loss: 0.0421 - acc: 0.9874 - val_loss: 0.0290 - val_acc: 0.9928
Epoch 10/10
60000/60000 [==============================] - 39s 646us/step - loss: 0.0374 - acc: 0.9884 - val_loss: 0.0253 - val_acc: 0.9922
10000/10000 [==============================] - 3s 288us/step

Test loss: 0.025273755778139458
Test accuracy: 0.9922