Training and generalization dynamics in deep linear neural networks

Speaker: Andrew Saxe , Harvard

When: April 4, 2017 (Tue), 11:00AM to 12:00PM (add to my calendar)
Location: SCI 328

This event is part of the Biophysics Seminars. 12:30PM.

Anatomically, the brain is deep; and computationally, deep learning is known to be hard. How might depth impact learning in the brain? To understand the specific ramifications of depth, I develop the theory of learning in deep linear neural networks. I will describe exact solutions to the dynamics of learning which specify how every weight in the network evolves over the course of training. The theory answers fundamental questions such as how learning speed scales with depth, how structured data sets are embedded into hidden neural representations, and why unsupervised pretraining accelerates learning. Turning to generalization error, we use random matrix theory to analyze the cognitively-relevant "high-dimensional" regime, where the number of training examples is on the order of or even less than the number of adjustable synapses. We find that generalization error can diverge in certain instances if training is run forever, but that implicit regularization in the form of early stopping and small initial weights substantially improves performance. Next, we turn to the question of how complex a model should be for optimal generalization. We describe a counter-intuitive regime where increasing the complexity of a model can lower both the approximation error and the estimation error of the system, resulting in substantial generalization benefits. This result may help explain the striking performance of even very large deep network models in practice, which often have more parameters than training samples. Finally, if time permits, I will describe an example of how these results may begin to inform the dynamics of nonlinear networks. In particular, I will describe a setting in which a nonlinear network can be understood as a collection of linear networks that learn in parallel, yielding learning dynamics which are dominated by the fastest linear network in the collection.