A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Junyu Luo; Bohan Wu; Xiao Luo; Zhiping Xiao; Yiqiao Jin; Rong-Cheng Tu; Nan Yin; Yifan Wang; Jingyang Yuan; Wei Ju; Ming Zhang

arXiv:2510.25817·cs.CL·October 31, 2025

A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

Junyu Luo, Bohan Wu, Xiao Luo, Zhiping Xiao, Yiqiao Jin, Rong-Cheng Tu, Nan Yin, Yifan Wang, Jingyang Yuan, Wei Ju, Ming Zhang

PDF

TL;DR

This survey reviews data-centric methods for efficient large language model post-training, emphasizing data selection, quality, synthetic data, and self-evolving ecosystems to address high costs and diminishing returns.

Contribution

It provides the first systematic taxonomy of data-efficient LLM post-training methods and outlines future research directions from a data-centric perspective.

Findings

01

Taxonomy of data-efficient post-training methods

02

Summary of representative approaches in each category

03

Identification of open problems and future research avenues

Abstract

Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.