TL;DR
This paper introduces a novel approach using martingale properties to derive non-vacuous generalization bounds for large language models, leveraging token-based data to improve bounds for models like LLaMA2-70B.
Contribution
It presents a new method that exploits token properties to obtain meaningful generalization bounds for large, high-quality language models, surpassing previous compression-based approaches.
Findings
Achieved non-vacuous bounds for LLaMA2-70B.
Bound tightness benefits from token-based data rather than document count.
Demonstrated bounds for models generating high-quality text.
Abstract
Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSparse Evolutionary Training
