Sharpness-Aware Minimization Improves Language Model Generalization
Dara Bahri, Hossein Mobahi, Yi Tay

TL;DR
This paper demonstrates that Sharpness-Aware Minimization (SAM) enhances the generalization of large language models across multiple benchmarks, especially in low-data scenarios, with minimal additional computational cost.
Contribution
It introduces the application of SAM to language model training, showing significant improvements in generalization without substantial computational overhead.
Findings
SAM improves performance on SuperGLUE, GLUE, and other benchmarks.
Large gains observed when training data is limited.
SAM encourages convergence to flatter minima, enhancing model robustness.
Abstract
The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsGated Linear Unit · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adafactor · SentencePiece · Cosine Annealing · Inverse Square Root Schedule · Weight Decay
