Most Influential Subset Selection: Challenges, Promises, and Beyond

Yuzheng Hu; Pingbang Hu; Han Zhao; Jiaqi W. Ma

arXiv:2409.18153·cs.LG·January 10, 2025

Most Influential Subset Selection: Challenges, Promises, and Beyond

Yuzheng Hu, Pingbang Hu, Han Zhao, Jiaqi W. Ma

PDF

Open Access 1 Repo

TL;DR

This paper investigates the challenges of selecting influential data subsets in machine learning, revealing limitations of existing influence-based heuristics and proposing adaptive methods that better capture complex sample interactions.

Contribution

It provides a comprehensive analysis of the Most Influential Subset Selection problem, identifying failure modes of current heuristics and proposing adaptive approaches to improve influence estimation.

Findings

01

Greedy heuristics can fail in linear regression scenarios.

02

Adaptive heuristics better capture sample interactions.

03

Experiments confirm theoretical insights on real-world data.

Abstract

How can we attribute the behaviors of machine learning models to their training data? While the classic influence function sheds light on the impact of individual samples, it often fails to capture the more complex and pronounced collective influence of a set of samples. To tackle this challenge, we study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses. Our findings reveal that influence-based greedy heuristics, a dominant class of algorithms in MISS, can provably fail even in linear regression. We delineate the failure modes, including the errors of influence function and the non-additive structure of the collective influence. Conversely, we demonstrate that an adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sleepymalc/miss
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Machine Learning and Data Classification

MethodsSparse Evolutionary Training