Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?
Nefeli Gkouti, Prodromos Malakasiotis, Stavros Toumpis, Ion, Androutsopoulos

TL;DR
This study investigates optimizer choices for fine-tuning pre-trained NLP models, finding that hyperparameter tuning, especially of the learning rate, minimizes performance differences among optimizers, simplifying the training process.
Contribution
The paper demonstrates that tuning only the learning rate of adaptive optimizers yields comparable results to tuning all hyperparameters, guiding optimizer selection in NLP fine-tuning.
Findings
Tuning hyperparameters reduces differences among optimizers.
Adaptive optimizers with tuned learning rates perform similarly.
SGD with Momentum is best when hyperparameters are not tuned.
Abstract
NLP research has explored different neural model architectures and sizes, datasets, training objectives, and transfer learning techniques. However, the choice of optimizer during training has not been explored as extensively. Typically, some variant of Stochastic Gradient Descent (SGD) is employed, selected among numerous variants, using unclear criteria, often with minimal or no tuning of the optimizer's hyperparameters. Experimenting with five GLUE datasets, two models (DistilBERT and DistilRoBERTa), and seven popular optimizers (SGD, SGD with Momentum, Adam, AdaMax, Nadam, AdamW, and AdaBound), we find that when the hyperparameters of the optimizers are tuned, there is no substantial difference in test performance across the five more elaborate (adaptive) optimizers, despite differences in training loss. Furthermore, tuning just the learning rate is in most cases as good as tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsSGD with Momentum · Stochastic Gradient Descent · AdamW · Adam · AdaMax
