The Marginal Value of Adaptive Gradient Methods in Machine Learning

Benjamin Recht and co-authors, after the revealing paper on generalization of Deep Learning, have delved into the failures of adaptive gradient methods.

First of all, they constructed a linearly separable classification example where adaptive methods fail miserably, achieving a classification accuracy arbitrarily close to random guessing. Conversely, standard gradient descent methods, which converge to the minimum norm solution, succeed to find the correct solution with zero prediction error.

Despite its artificiality, this simple example clearly shows that adaptive and non-adaptive gradient methods can converge to very different solutions.

Then, the authors provide substantial experimental evidence that adaptive methods do not generalize as well as non-adaptive ones, given the same amount of tuning, on four machine learning tasks addressed with deep learning architectures:

  1. Image classification (C1) on the CIFAR-10 dataset with a deep convolutional network;
  2. Character-level language modeling (L1) on the War and Peace novel with a 2-layer LSTM;
  3. Discriminative (L2) and
  4. Generative (L3) parsing on the Penn Treebank dataset with LSTM.

Wilson_Marginal_Value_Table

The experiments show the following findings:

  1. “Adaptive method find solutions that generalize worse than those found by non-adaptive methods.”
  2. “Even when the adaptive method achieve the same training loss or lower than non-adaptive methods, the development or test performance is worse.”
  3. “Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development set.”
  4. “Though conventional wisdom suggests that Adam does not require tuning, we find that tuning the initial learning rate and decay scheme for Adam yields significant improvements over its default settings in all cases.”

The plots below are an illustration of these finding for image classification task.

Wilson_Marginal_Value_Figure.png

The paper can be found on arXiv.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s