ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining
Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach

TL;DR
ESLM introduces a risk-aware, token-level selection method for language model pretraining that reduces computational costs while maintaining or improving performance, by focusing on high-risk, informative tokens during training.
Contribution
This paper presents ESLM, a novel risk-averse token selection algorithm for efficient language model pretraining, with theoretical foundations and practical improvements demonstrated on GPT-2.
Findings
Reduces training FLOPs significantly.
Maintains or improves perplexity and downstream tasks.
Scales across model sizes and datasets.
Abstract
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Dense Connections · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · Weight Decay · Multi-Head Attention
