Effective Data Pruning through Score Extrapolation

Sebastian Schmidt; Prasanga Dhungel; Christoffer L\"offler; Bj\"orn Nieth; Stephan G\"unnemann; Leo Schwinn

arXiv:2506.09010·cs.LG·June 23, 2025

Effective Data Pruning through Score Extrapolation

Sebastian Schmidt, Prasanga Dhungel, Christoffer L\"offler, Bj\"orn Nieth, Stephan G\"unnemann, Leo Schwinn

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel importance score extrapolation framework that predicts sample importance using minimal data, enabling efficient data pruning without full initial training, applicable across various datasets and training paradigms.

Contribution

The paper proposes a new importance score extrapolation method using k-nearest neighbors and graph neural networks to reduce training costs in data pruning.

Findings

01

Effective across multiple datasets and training paradigms

02

Reduces computational costs of data pruning techniques

03

Applicable to state-of-the-art pruning methods

Abstract

Training advanced machine learning models demands massive datasets, resulting in prohibitive computational costs. To address this challenge, data pruning techniques identify and remove redundant training samples while preserving model performance. Yet, existing pruning techniques predominantly require a full initial training pass to identify removable samples, negating any efficiency benefits for single training runs. To overcome this limitation, we introduce a novel importance score extrapolation framework that requires training on only a small subset of data. We present two initial approaches in this framework - k-nearest neighbors and graph neural networks - to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset. We demonstrate the effectiveness of our approach for 2 state-of-the-art pruning methods (Dynamic Uncertainty and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. Using the Extrapolation method to refine the sample ranking is rather new.

Weaknesses

1. The critical weakness is the overlooked training cost. The method requires: 1) First training: Train on random 10-20% subset to compute initial scores. 2) Embedding + Extrapolation: Extract features and extrapolate scores for remaining 80-90%. 3) Second training: Train final model on the extrapolated-pruned subset. 1.1 This means training happens TWICE on similar-sized subsets, plus embedding the full dataset. 1.2 Did the authors account for the cost of BOTH training phases? The paper clai

Reviewer 02Rating 6Confidence 3

Strengths

- The paper tackles a practically significant and under-addressed problem: how to make computationally expensive data pruning methods tractable for large-scale training by requiring only a small subset for direct score computation. - The proposed score extrapolation framework is methodologically interesting and is instantiated with both a simple, transparent KNN approach and a more expressive, message-passing-based GNN, allowing for a clear analysis of trade-offs. - Empirical validation is thoro

Weaknesses

1. **Limited theoretical justification and over-reliance on local linearity assumptions:** The primary mathematical support for extrapolation is drawn from influence function and local linearity arguments (Section 3). Yet, there is insufficient theoretical development or empirical diagnosis regarding the validity of these assumptions for highly nonlinear, high-dimensional representation spaces found in deep learning. As such, generalizability of the approach to broader architectures/tasks re

Reviewer 03Rating 4Confidence 3

Strengths

In general I found this work interesting, it tackles a less explored direction in data pruning and can indeed bring valuable computational gains. - The notion of extrapolating importance scores from a small subset is simple but original (to my knowledge), and it provides a new angle on making pruning efficient. - The KNN and GNN approaches effectively demonstrate that extrapolation can cut computation time with little performance loss and form a good proof of concept. - The work evaluates multi

Weaknesses

Overall, the paper presents an interesting idea, though I think it could be strengthened with some further development and analysis. - I did not find the theoretical justification very convincing. It seems that the main point is about the smooth interpolation of the samples influence that the authors use as a justification for their extrapolated scores (eq 6). But in the context of influence the extrapolated point is itself a convex interpolation of the reference points, and the weights are the

Code & Models

Repositories

prasangadhungel/Data-Pruning-with-Extrapolated-Scores
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)

MethodsPruning