Sharpness-Aware Minimization Improves Language Model Generalization

Dara Bahri; Hossein Mobahi; Yi Tay

arXiv:2110.08529·cs.CL·March 17, 2022·1 cites

Sharpness-Aware Minimization Improves Language Model Generalization

Dara Bahri, Hossein Mobahi, Yi Tay

PDF

Open Access

TL;DR

This paper demonstrates that Sharpness-Aware Minimization (SAM) enhances the generalization of large language models across multiple benchmarks, especially in low-data scenarios, with minimal additional computational cost.

Contribution

It introduces the application of SAM to language model training, showing significant improvements in generalization without substantial computational overhead.

Findings

01

SAM improves performance on SuperGLUE, GLUE, and other benchmarks.

02

Large gains observed when training data is limited.

03

SAM encourages convergence to flatter minima, enhancing model robustness.

Abstract

The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsGated Linear Unit · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adafactor · SentencePiece · Cosine Annealing · Inverse Square Root Schedule · Weight Decay