Geometric Median (GM) Matching for Robust Data Pruning
Anish Acharya, Inderjit S Dhillon, Sujay Sanghavi

TL;DR
This paper introduces Geometric Median Matching, a robust data pruning method that effectively handles noisy datasets by approximating the geometric median, outperforming previous techniques especially under high corruption.
Contribution
We propose a novel greedy algorithm for robust data pruning that achieves optimal breakdown point and improved scaling over uniform sampling.
Findings
Outperforms prior methods in noisy data scenarios
Achieves optimal breakdown point of 1/2 under arbitrary corruption
Shows significant improvements at high corruption and pruning rates
Abstract
Large-scale data collections in the wild, are invariably noisy. Thus developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. In this work, we propose Geometric Median () Matching -- a herding style greedy algorithm that yields a -subset such that the mean of the subset approximates the geometric median of the (potentially) noisy dataset. Theoretically, we show that Matching enjoys an improved scaling over scaling of uniform sampling; while achieving {\bf optimal breakdown point} of {\bf 1/2} even under {\bf arbitrary} corruption. Extensive experiments across several popular deep learning benchmarks indicate that Matching consistently improves over prior state-of-the-art; the gains become more profound at high rates of corruption and aggressive pruning rates; making Matching a…
Peer Reviews
Decision·Submitted to ICLR 2025
See the summarization part.
1. It is rather difficult for me to identify the innovative points of the proposed scheme in this paper compared to previous methods. For instance, the Moderate based on Geometric Median is also available in other approaches [1,2]. In the context of research progress, innovation is the key driving force. Without clear differentiating factors, it becomes challenging to justify the novelty and significance of this new scheme within the existing body of knowledge. There should be a distinct advanta
The paper is well written and easy to follow.
1. Some important references and comparisons are missing, e.g. [1] Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy.(NeurIPs 2023), [2] Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection.(AAAI 2024). The reported performance is not SOTA. 2. The main idea of the paper is to utilize the Geometric Median instead of the empirical mean objective in moment matching. However, as mentioned, both Geometric Median and Moment Matching hav
1. It is very meaningful to solve the problem of data pruning in the noisy label scene.
1. The coverage of related work is not enough, which is not consistent with contribution point one. Many existing works study the robust data pruning in noisy scenarios (e.g. [1]Prune4R4L, [2]FDMat). The performance of these methods is much higher than that of the proposed GM matching. 2. The motivation of the paper is unclear and the difference from baseline (Moderate) is not described. The contribution of the paper is unclear. 3. The paper is poorly written. The description of the formulas la
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Mining Algorithms and Applications · Advanced Database Systems and Queries
MethodsPruning
