Efficient Training of Language Models with Compact and Consistent Next   Token Distributions

Ashutosh Sathe; Sunita Sarawagi

arXiv:2407.02819·cs.CL·July 4, 2024

Efficient Training of Language Models with Compact and Consistent Next Token Distributions

Ashutosh Sathe, Sunita Sarawagi

PDF

Open Access

TL;DR

This paper introduces a compact, efficient method for training language models by pre-aggregating corpus statistics with an $n$-gram distribution, leading to faster training and improved model quality.

Contribution

The authors propose a novel compact representation of next token distribution that aligns with $n$-gram statistics, reducing variance and enabling scalable, faster training of language models.

Findings

01

Significant improvements in model quality and convergence rate.

02

Enhanced scalability of training with larger datasets and models.

03

Efficient approximation of $n$-gram regularization benefits.

Abstract

Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$ -gram distribution. Previous studies have proposed corpus-level $n$ -gram statistics as a regularizer; however, the construction and querying of such $n$ -grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete $n$ -gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the $n$ -gram regularized model and our approximation yield…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis