Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence   Models

Jared Lichtarge; Chris Alberti; Shankar Kumar

arXiv:2209.04683·cs.CL·September 13, 2022

Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

Jared Lichtarge, Chris Alberti, Shankar Kumar

PDF

Open Access

TL;DR

This paper introduces a simple gradient-based hyperparameter tuning method for sequence-to-sequence models, demonstrating efficiency and performance improvements in machine translation and language understanding tasks.

Contribution

It is the first to apply gradient-based hyperparameter optimization to sequence-to-sequence models, showing benefits over traditional methods and across multiple NLP tasks.

Findings

01

Gradient-based tuning outperforms Bayesian optimization.

02

Hyper-parameter schedules can surpass constant tuning.

03

Learning hyper-parameters during pretraining improves downstream performance.

Abstract

Recent trends towards training ever-larger language models have substantially improved machine learning performance across linguistic tasks. However, the huge cost of training larger models can make tuning them prohibitively expensive, motivating the study of more efficient methods. Gradient-based hyper-parameter optimization offers the capacity to tune hyper-parameters during training, yet has not previously been studied in a sequence-to-sequence setting. We apply a simple and general gradient-based hyperparameter optimization method to sequence-to-sequence tasks for the first time, demonstrating both efficiency and performance gains over strong baselines for both Neural Machine Translation and Natural Language Understanding (NLU) tasks (via T5 pretraining). For translation, we show the method generalizes across language pairs, is more efficient than Bayesian hyper-parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Adafactor · SentencePiece · Dropout · Dense Connections · Residual Connection · Layer Normalization