Unlocking Tokens as Data Points for Generalization Bounds on Larger   Language Models

Sanae Lotfi; Yilun Kuang; Brandon Amos; Micah Goldblum; Marc Finzi,; Andrew Gordon Wilson

arXiv:2407.18158·stat.ML·July 26, 2024

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Sanae Lotfi, Yilun Kuang, Brandon Amos, Micah Goldblum, Marc Finzi,, Andrew Gordon Wilson

PDF

1 Video

TL;DR

This paper introduces a novel approach using martingale properties to derive non-vacuous generalization bounds for large language models, leveraging token-based data to improve bounds for models like LLaMA2-70B.

Contribution

It presents a new method that exploits token properties to obtain meaningful generalization bounds for large, high-quality language models, surpassing previous compression-based approaches.

Findings

01

Achieved non-vacuous bounds for LLaMA2-70B.

02

Bound tightness benefits from token-based data rather than document count.

03

Demonstrated bounds for models generating high-quality text.

Abstract

Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models· slideslive

Taxonomy

MethodsSparse Evolutionary Training