ZPD Detector: Data Selection via Capability-Difficulty Alignment for Large Language Models

Bo Yang; Yunkui Chen; Lanfei Feng; Yu Zhang; Shijian Li

arXiv:2601.10986·cs.CL·January 19, 2026

ZPD Detector: Data Selection via Capability-Difficulty Alignment for Large Language Models

Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Shijian Li

PDF

Open Access

TL;DR

The paper introduces ZPD Detector, a dynamic data selection framework for large language models that aligns sample difficulty with model capability, enhancing training efficiency and offering new insights into training strategies.

Contribution

It proposes a novel data selection method based on the Zone of Proximal Development, modeling the evolving relationship between model capability and data difficulty.

Findings

01

Improves data utilization efficiency during training

02

Provides a dynamic sample selection strategy based on capability-difficulty alignment

03

Offers insights into training strategy design for large language models

Abstract

As the cost of training large language models continues to increase and high-quality training data become increasingly scarce, selecting high-value samples or synthesizing effective training data under limited data budgets has emerged as a critical research problem. Most existing data selection methods rely on static criteria, such as difficulty, uncertainty, or heuristics, and fail to model the evolving relationship between the model and the data. Inspired by the educational theory of the Zone of Proximal Development (ZPD), we propose ZPD Detector, a data selection framework that adopts a bidirectional perspective between models and data by explicitly modeling the alignment between sample difficulty and the model's current capability. ZPD Detector integrates difficulty calibration, model capability estimation based on Item Response Theory (IRT), and a capability-difficulty matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage Development and Disorders · Topic Modeling · Domain Adaptation and Few-Shot Learning