Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Yuqing Wang; Shangding Gu

arXiv:2506.24120·cs.LG·September 30, 2025

Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Yuqing Wang, Shangding Gu

PDF

Open Access 3 Reviews

TL;DR

This paper shows that selecting more uniformly distributed data improves training efficiency and performance in neural networks, supported by a new convergence framework beyond the NTK regime applicable to various architectures.

Contribution

It introduces a theoretical framework for gradient descent convergence beyond NTK, linking data uniformity to training speed and accuracy, and validates findings with extensive experiments.

Findings

01

Uniform data selection accelerates training.

02

Smaller $h_{min}$ slows down gradient descent.

03

Maximizing pairwise data distance improves performance.

Abstract

Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complicated tasks. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by $h_{m i n}$ , and prove that a smaller $h_{m i n}$ can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The paper's key strength is its strong and practically relevant empirical result: that a small, uniformly-selected data subset can fine-tune LLMs significantly faster while matching the performance of the full dataset. Moreover, the paper is theoretically ambitious (I'm not sure if the results do actually imply what the authors claim, see weaknesses), tackling the important problem of data selection by attempting to build a convergence framework for non-linear architectures.

Weaknesses

**Presentation**: The paper is very densely written, making the theoretical arguments difficult to read and understand. The overall presentation could be significantly improved for clarity. **Beyond NTK Claim**: The 'beyond NTK' claim is not fully convincing. In standard NTK analysis, a PL-like inequality is proven where the constant is the minimum eigenvalue of the kernel at initialization. This paper seems to follow a similar structure, proving a PL-like inequality (Figure 2) where the PL-con

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper provides a new perspective on how data uniformity helps with training, justified with theoretical analysis. The effectiveness of the proposed approach is validated through empirical results. 2. The proposed Poly-smoothness condition aligns better with neural networks used in practice, compared to standard Lipschitzness. This might be helpful for future analysis of deep neural networks.

Weaknesses

For the theoretical part: 1. It is unclear why the minimum pairwise distance $h_{\min}$ is a good characterization of data uniformity. Specifically, when the data distribution is fixed, $h_{\min}$ will decrease as the sample size increases. This means that the convergence speed in Theorem 2 becomes slower with more samples and becomes $0$ when the sample size tends to infinity. Is this an intended behaviour? What if we consider infinitely many data points sampled from a continuous distribution (

Reviewer 03Rating 4Confidence 2

Strengths

Strengths * The paper tackles data selection, an increasingly important topic for efficient LLM training. * Comprehensive theory analysis, provides a general convergence result beyond NTK assumptions and links data geometry to dynamics and approximation error.

Weaknesses

Weaknesses * The paper uses max-min distance sampling as the core uniformity criterion. However, pure maximum distance does not necessarily guarantee globally uniform coverage. For example, if the data contains two distant dense clusters, the greedy selection may oscillate between these clusters and ignore other regions of the space. Please correct me if this interpretation is incorrect. * The proposed selection strategy is closely related to prior work on distance-based uniform sampling. For

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic Growth and Productivity