Transcending Scaling Laws with 0.1% Extra Compute

Yi Tay; Jason Wei; Hyung Won Chung; Vinh Q. Tran; David R. So; Siamak; Shakeri; Xavier Garcia; Huaixiu Steven Zheng; Jinfeng Rao; Aakanksha; Chowdhery; Denny Zhou; Donald Metzler; Slav Petrov; Neil Houlsby; Quoc V. Le,; Mostafa Dehghani

arXiv:2210.11399·cs.CL·November 17, 2022·6 cites

Transcending Scaling Laws with 0.1% Extra Compute

Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak, Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha, Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le,, Mostafa Dehghani

PDF

Open Access 3 Models

TL;DR

This paper introduces UL2R, a method that enhances large language models' performance and scaling efficiency with minimal additional compute, leading to significant improvements in downstream tasks and emergent abilities.

Contribution

The paper presents UL2R, a simple continuation training approach that improves large language models' scaling properties without extra data or significant computational costs.

Findings

01

U-PaLM achieves similar performance to PaLM 540B at half the computational cost.

02

U-PaLM demonstrates improved performance on diverse NLP tasks and emergent abilities.

03

The method yields qualitative improvements in infilling capabilities.

Abstract

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsPathways Language Model