Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models
Jared Lichtarge, Chris Alberti, Shankar Kumar

TL;DR
This paper introduces a simple gradient-based hyperparameter tuning method for sequence-to-sequence models, demonstrating efficiency and performance improvements in machine translation and language understanding tasks.
Contribution
It is the first to apply gradient-based hyperparameter optimization to sequence-to-sequence models, showing benefits over traditional methods and across multiple NLP tasks.
Findings
Gradient-based tuning outperforms Bayesian optimization.
Hyper-parameter schedules can surpass constant tuning.
Learning hyper-parameters during pretraining improves downstream performance.
Abstract
Recent trends towards training ever-larger language models have substantially improved machine learning performance across linguistic tasks. However, the huge cost of training larger models can make tuning them prohibitively expensive, motivating the study of more efficient methods. Gradient-based hyper-parameter optimization offers the capacity to tune hyper-parameters during training, yet has not previously been studied in a sequence-to-sequence setting. We apply a simple and general gradient-based hyperparameter optimization method to sequence-to-sequence tasks for the first time, demonstrating both efficiency and performance gains over strong baselines for both Neural Machine Translation and Natural Language Understanding (NLU) tasks (via T5 pretraining). For translation, we show the method generalizes across language pairs, is more efficient than Bayesian hyper-parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Adafactor · SentencePiece · Dropout · Dense Connections · Residual Connection · Layer Normalization
