ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

Melis Ilayda Bal; Volkan Cevher; Michael Muehlebach

arXiv:2505.19893·cs.LG·May 27, 2025

ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach

PDF

Open Access

TL;DR

ESLM introduces a risk-aware, token-level selection method for language model pretraining that reduces computational costs while maintaining or improving performance, by focusing on high-risk, informative tokens during training.

Contribution

This paper presents ESLM, a novel risk-averse token selection algorithm for efficient language model pretraining, with theoretical foundations and practical improvements demonstrated on GPT-2.

Findings

01

Reduces training FLOPs significantly.

02

Maintains or improves perplexity and downstream tasks.

03

Scales across model sizes and datasets.

Abstract

Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Text Readability and Simplification · Intelligent Tutoring Systems and Adaptive Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Dense Connections · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · Weight Decay · Multi-Head Attention