Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments
Danilo Brajovic, David A. Kreplin, Marco F. Huber

TL;DR
This paper introduces CDVM, a new data pruning method that improves model performance in low-data environments by effectively utilizing data attributions through constrained optimization.
Contribution
The paper proposes CDVM, a novel approach that enhances data pruning in low-data scenarios by optimizing data influence while controlling per-test contributions.
Findings
CDVM outperforms existing methods on the OpenDataVal benchmark.
It maintains robust performance with minimal data retention.
CDVM achieves competitive runtime performance.
Abstract
Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
