Benjamin Recht and co-authors, after the revealing paper on generalization of Deep Learning, have delved into the failures of adaptive gradient methods.
First of all, they constructed a linearly separable classification example where adaptive methods fail miserably, achieving a classification accuracy arbitrarily close to random guessing. Conversely, standard gradient descent methods, which converge to the minimum norm solution, succeed to find the correct solution with zero prediction error.
Despite its artificiality, this simple example clearly shows that adaptive and non-adaptive gradient methods can converge to very different solutions.
Then, the authors provide substantial experimental evidence that adaptive methods do not generalize as well as non-adaptive ones, given the same amount of tuning, on four machine learning tasks addressed with deep learning architectures:
- Image classification (C1) on the CIFAR-10 dataset with a deep convolutional network;
- Character-level language modeling (L1) on the War and Peace novel with a 2-layer LSTM;
- Discriminative (L2) and
- Generative (L3) parsing on the Penn Treebank dataset with LSTM.
The experiments show the following findings:
- “Adaptive method find solutions that generalize worse than those found by non-adaptive methods.”
- “Even when the adaptive method achieve the same training loss or lower than non-adaptive methods, the development or test performance is worse.”
- “Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development set.”
- “Though conventional wisdom suggests that Adam does not require tuning, we find that tuning the initial learning rate and decay scheme for Adam yields significant improvements over its default settings in all cases.”
The plots below are an illustration of these finding for image classification task.