Influence-driven Curriculum Learning for Pre-training on Limited Data

Loris Schoenegger; Lukas Thoma; Terra Blevins; Benjamin Roth

arXiv:2508.15475·cs.CL·September 29, 2025

Influence-driven Curriculum Learning for Pre-training on Limited Data

Loris Schoenegger, Lukas Thoma, Terra Blevins, Benjamin Roth

PDF

Open Access

TL;DR

This paper proposes a new curriculum learning approach for pre-training language models by sorting data based on training data influence, leading to significant performance improvements over random data ordering.

Contribution

It introduces a model-centric difficulty metric, training data influence, for curriculum learning, demonstrating its effectiveness in pre-training language models on limited data.

Findings

01

Models trained with influence-based curriculum outperform random order by over 10 percentage points.

02

Curriculum learning with a model-centric difficulty metric is more effective than traditional methods.

03

The approach improves pre-training efficiency on benchmark datasets.

Abstract

Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Natural Language Processing Techniques