Influential Language Data Selection via Gradient Trajectory Pursuit
Zhiwei Deng, Tao Li, and Yang Li

TL;DR
This paper introduces Gradient Trajectory Pursuit (GTP), a novel data selection algorithm for large language models that jointly selects data points based on gradient trajectories, improving efficiency and performance over existing methods.
Contribution
The paper presents GTP, a joint data selection method using L0 regularization that enhances efficiency and deduplication in data curation for language models.
Findings
GTP outperforms top-k selection and other algorithms in benchmarks.
GTP achieves full task performance using only 0.5% of data.
GTP is scalable with distributed computing.
Abstract
Curating a desirable dataset for training has been the core of building highly capable large language models (Touvron et al., 2023; Achiam et al., 2023; Team et al.,2024). Gradient influence scores (Pruthi et al., 2020; Xia et al., 2024) are shown to be correlated with model performance and are commonly used as the criterion for data selection. However, existing methods are built upon either individual sample rankings or inefficient matching process, leading to suboptimal performance or scaling up issues.In this paper, we propose Gradient Trajectory Pursuit (GTP), an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective. The proposed algorithm highlights: (1) joint selection instead of independent top-k selection, which automatically de-duplicates samples; (2) higher efficiency with compressive sampling…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Data selection for effective model training has become a very important problem for large language models. Hence, this submission touches upon a timely issue. 2. The descriptions of the proposed solution and related work are largely clear and easy to follow. This helps readers to understand the essence of this submission.
1. It is well known that modern large language model training has several stages, e.g., pre-training and post-training. Given the submission is targeted at language model training, it is very unclear which stage the proposed method is designed for. Since different stages have distinct purposes, specific designs are needed, though their loss function might look similar (e.g., next token prediction). 2. The proposed solution is very similar to the frequently mentioned LESS algorithm (Xia et al. 20
S1: With the increasing model and dataset size, it is an important field to study effective data selection for LLMs. And the contribution of this paper seems non-trivial to the field. S2: The overall idea of performing joint selection based on gradient trajectory matching is intuitive and reasonable. S3: The empirical results based on ALFWorld and three instruction tuning evaluation sets are consistently better than the considered baselines.
W1: The writing and presentation of the paper need substantial improvement. It is very unclear how Equation 1 is derived from the main idea discussed in Section 3.1. The three algorithm blocks should also be described with further details in the content sections. In general, I find Section 3 very hard to follow. W2: The scope of the current experiments seems limited. It would be great to experiment with multiple LLMs from different model families and sizes for each baseline. W3: It would be gr
**Strengths:** Incorporating the combinatorial effect into data selection is intuitive, as a top-k selection approach may not fully capture the collaborative potential among data samples. The method demonstrates considerable performance improvement over the vanilla LESS method on the ALFWorld dataset.
**Weaknesses:** As noted in subsection 3.3, the main distinction from Xia et al. (2024) lies in the departure from the top-k selection paradigm by incorporating ideas from Needell & Tropp (2009), while most other components remain unchanged. The computational time analysis in subsection 3.3 is unclear. There is no analysis of computational complexity, nor are there empirical runtime results or GPU memory cost reports. Additionally, PCA may present a computational challenge in large-scale exper
-- This type of dataset selection technique is generally important and timely. -- The paper is clearly written (despite various small types), and has enough detail that what's described should be more-or-less reproducible. -- The results appear fairly impressive, though seem to be fairly modest compared to the strongest baseline -- Reasonably strong methodology paper, though the idea is ultimately fairly simple and described in pretty clear terms.
-- The contribution compared to LESS seems not so large, and the performance improvements are also fairly incremental (though they are still significant) -- Experiments seem overall somewhat less thorough (e.g. in terms of models being compared) than in prior work (such as LESS), though the authors do have some explanations for this -- Hard to totally make sense of some of the experimental results in a few places
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Text and Document Classification Technologies
