Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review
Neha Prakriya, Jui-Nan Yen, Cho-Jui Hsieh, Jason Cong

TL;DR
This paper introduces the LFR pedagogy, a dynamic training method for large language models that improves learning efficiency and retention by focusing on challenging data regions, reducing training costs significantly.
Contribution
The paper proposes the Learn-Focus-Review paradigm, a novel adaptive training approach that enhances LLM pretraining efficiency by prioritizing difficult data regions based on model performance.
Findings
LFR reduces training tokens by up to 19% while maintaining performance.
LFR pretrained models outperform baseline models in various tasks.
LFR matches or exceeds industry-standard models with fewer training tokens.
Abstract
Traditional Large Language Model (LLM) pretraining relies on autoregressive language modeling with randomly sampled data from web-scale datasets. Inspired by human learning techniques like spaced repetition, we hypothesize that random sampling leads to high training costs, lower-quality models, and significant data forgetting. To address these inefficiencies, we propose the Learn-Focus-Review (LFR) paradigm -- a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset that are more prone to being forgotten, enabling better retention and more efficient learning. Using the LFR paradigm, we pretrained Llama and GPT models on the SlimPajama and OpenWebText datasets, respectively. These models were evaluated on downstream tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · GPT · LLaMA · Pythia · Linear Layer · Multi-Head Attention · Cosine Annealing · Byte Pair Encoding · Softmax
