Geometric Median (GM) Matching for Robust Data Pruning

Anish Acharya; Inderjit S Dhillon; Sujay Sanghavi

arXiv:2406.17188·cs.LG·January 20, 2025

Geometric Median (GM) Matching for Robust Data Pruning

Anish Acharya, Inderjit S Dhillon, Sujay Sanghavi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Geometric Median Matching, a robust data pruning method that effectively handles noisy datasets by approximating the geometric median, outperforming previous techniques especially under high corruption.

Contribution

We propose a novel greedy algorithm for robust data pruning that achieves optimal breakdown point and improved scaling over uniform sampling.

Findings

01

Outperforms prior methods in noisy data scenarios

02

Achieves optimal breakdown point of 1/2 under arbitrary corruption

03

Shows significant improvements at high corruption and pruning rates

Abstract

Large-scale data collections in the wild, are invariably noisy. Thus developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. In this work, we propose Geometric Median ( $\gm$ ) Matching -- a herding style greedy algorithm that yields a $k$ -subset such that the mean of the subset approximates the geometric median of the (potentially) noisy dataset. Theoretically, we show that $\gm$ Matching enjoys an improved $\gO (1/ k)$ scaling over $\gO (1/ k)$ scaling of uniform sampling; while achieving {\bf optimal breakdown point} of {\bf 1/2} even under {\bf arbitrary} corruption. Extensive experiments across several popular deep learning benchmarks indicate that $\gm$ Matching consistently improves over prior state-of-the-art; the gains become more profound at high rates of corruption and aggressive pruning rates; making $\gm$ Matching a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

See the summarization part.

Weaknesses

1. It is rather difficult for me to identify the innovative points of the proposed scheme in this paper compared to previous methods. For instance, the Moderate based on Geometric Median is also available in other approaches [1,2]. In the context of research progress, innovation is the key driving force. Without clear differentiating factors, it becomes challenging to justify the novelty and significance of this new scheme within the existing body of knowledge. There should be a distinct advanta

Reviewer 02Rating 3Confidence 4

Strengths

The paper is well written and easy to follow.

Weaknesses

1. Some important references and comparisons are missing, e.g. [1] Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy.(NeurIPs 2023), [2] Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection.(AAAI 2024). The reported performance is not SOTA. 2. The main idea of the paper is to utilize the Geometric Median instead of the empirical mean objective in moment matching. However, as mentioned, both Geometric Median and Moment Matching hav

Reviewer 03Rating 3Confidence 5

Strengths

1. It is very meaningful to solve the problem of data pruning in the noisy label scene.

Weaknesses

1. The coverage of related work is not enough, which is not consistent with contribution point one. Many existing works study the robust data pruning in noisy scenarios (e.g. [1]Prune4R4L, [2]FDMat). The performance of these methods is much higher than that of the proposed GM matching. 2. The motivation of the paper is unclear and the difference from baseline (Moderate) is not described. The contribution of the paper is unclear. 3. The paper is poorly written. The description of the formulas la

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Data Mining Algorithms and Applications · Advanced Database Systems and Queries

MethodsPruning