Effective Data Pruning through Score Extrapolation
Sebastian Schmidt, Prasanga Dhungel, Christoffer L\"offler, Bj\"orn Nieth, Stephan G\"unnemann, Leo Schwinn

TL;DR
This paper introduces a novel importance score extrapolation framework that predicts sample importance using minimal data, enabling efficient data pruning without full initial training, applicable across various datasets and training paradigms.
Contribution
The paper proposes a new importance score extrapolation method using k-nearest neighbors and graph neural networks to reduce training costs in data pruning.
Findings
Effective across multiple datasets and training paradigms
Reduces computational costs of data pruning techniques
Applicable to state-of-the-art pruning methods
Abstract
Training advanced machine learning models demands massive datasets, resulting in prohibitive computational costs. To address this challenge, data pruning techniques identify and remove redundant training samples while preserving model performance. Yet, existing pruning techniques predominantly require a full initial training pass to identify removable samples, negating any efficiency benefits for single training runs. To overcome this limitation, we introduce a novel importance score extrapolation framework that requires training on only a small subset of data. We present two initial approaches in this framework - k-nearest neighbors and graph neural networks - to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset. We demonstrate the effectiveness of our approach for 2 state-of-the-art pruning methods (Dynamic Uncertainty and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Using the Extrapolation method to refine the sample ranking is rather new.
1. The critical weakness is the overlooked training cost. The method requires: 1) First training: Train on random 10-20% subset to compute initial scores. 2) Embedding + Extrapolation: Extract features and extrapolate scores for remaining 80-90%. 3) Second training: Train final model on the extrapolated-pruned subset. 1.1 This means training happens TWICE on similar-sized subsets, plus embedding the full dataset. 1.2 Did the authors account for the cost of BOTH training phases? The paper clai
- The paper tackles a practically significant and under-addressed problem: how to make computationally expensive data pruning methods tractable for large-scale training by requiring only a small subset for direct score computation. - The proposed score extrapolation framework is methodologically interesting and is instantiated with both a simple, transparent KNN approach and a more expressive, message-passing-based GNN, allowing for a clear analysis of trade-offs. - Empirical validation is thoro
1. **Limited theoretical justification and over-reliance on local linearity assumptions:** The primary mathematical support for extrapolation is drawn from influence function and local linearity arguments (Section 3). Yet, there is insufficient theoretical development or empirical diagnosis regarding the validity of these assumptions for highly nonlinear, high-dimensional representation spaces found in deep learning. As such, generalizability of the approach to broader architectures/tasks re
In general I found this work interesting, it tackles a less explored direction in data pruning and can indeed bring valuable computational gains. - The notion of extrapolating importance scores from a small subset is simple but original (to my knowledge), and it provides a new angle on making pruning efficient. - The KNN and GNN approaches effectively demonstrate that extrapolation can cut computation time with little performance loss and form a good proof of concept. - The work evaluates multi
Overall, the paper presents an interesting idea, though I think it could be strengthened with some further development and analysis. - I did not find the theoretical justification very convincing. It seems that the main point is about the smooth interpolation of the samples influence that the authors use as a justification for their extrapolated scores (eq 6). But in the context of influence the extrapolated point is itself a convex interpolation of the reference points, and the weights are the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)
MethodsPruning
