SAVA: Scalable Learning-Agnostic Data Valuation
Samuel Kessler, Tam Le, Vu Nguyen

TL;DR
SAVA introduces a scalable data valuation method using optimal transport, enabling efficient evaluation of large datasets without sacrificing accuracy, by processing data in batches instead of the entire dataset.
Contribution
This paper proposes SAVA, a scalable variant of LAVA, that applies stochastic optimal transport on data batches for efficient large-scale data valuation.
Findings
SAVA scales to datasets with millions of points.
SAVA maintains valuation accuracy comparable to LAVA.
Theoretical analysis of entropic regularization trade-offs.
Abstract
Selecting data for training machine learning models is crucial since large, web-scraped, real datasets contain noisy artifacts that affect the quality and relevance of individual data points. These noisy artifacts will impact model performance. We formulate this problem as a data valuation task, assigning a value to data points in the training set according to how similar or dissimilar they are to a clean and curated validation set. Recently, LAVA demonstrated the use of optimal transport (OT) between a large noisy training dataset and a clean validation set, to value training data efficiently, without the dependency on model performance. However, the LAVA algorithm requires the entire dataset as an input, this limits its application to larger datasets. Inspired by the scalability of stochastic (gradient) approaches which carry out computations on batches of data points instead of the…
Peer Reviews
Decision·ICLR 2025 Poster
The problem is well-contextualized and the motivation is clear. Structure of the paper is well balanced and the elaborations are coherent. It is straightforward for readers to understand the scope and target of the paper and the proposed technical approaches. The proposed method is plausible, leveraging the hierarhical OT framework to aggregate results from batch-wise OT computations and achieving favorable approximation results. Derivations are comprehensive and are paired with substantial el
I am still somewhat concerned about the computation overhead for SAVA. Even it avoids directly solving large-scale OT problems and circumvents OOM issues, it now requires solving a quadratic number of OT problems between every pair of batches and aggregating their results. This could also take a significant amount of time if the number of batches are high. Are there results on actual time comparisons for the methods in empirical studies? The structure of the paper still has room to improve. Th
- The experimental results are convincing. The authors compared to SOTA methods for data valuation across various data corruption scenarios. The results demonstrate that SAVA is scalable to large datasets. Also, the results included a dataset of size larger than 1 million samples, in which the proposed method outperforms benchmarks. - The writing is good and easy to follow.
- The reviewer's biggest concern is related to novelty. Currently, SAVA seems a very natural extension of LAVA for data valuation on batches. The submission seems to be on the incremental side, unless the authors can clearly state the technical challenge when calculating on batches. - The choice of batch size is a key hyper-parameter in SAVA (and key difference to LAVA). The authors are suggested to include formal theoretical analysis to quantify the tradeoff in choosing batch size between mem
[S1] An interesting approach leveraging the idea of batches to solve the memory bottleneck encountered in OT solver as optimizer in model training. [S2] Detailed theoretical proofs and descriptions of previous work are given. [S3] The article is well-organized and easy to read.
[W1] My biggest concern is the proof of the upper bound does not adequately explain why this proxy can work. Detailed analysis on the upper bound of the proxy practicability should be taken. [W2] My second concern is that the paper lacks of time complexity analysis. And SAVA in Figure 2 seems to be no better than Batch-wise LAVA. In the appendix Figure 9, why not compare Batch-wise LAVA in running time metric? [W3] Typos: Line 417, "Batch-wise LAVA KNN Shapley and" -> "Batch-wise LAVA, KNN S
Videos
Taxonomy
TopicsMachine Learning in Healthcare
