TL;DR
This paper develops a formal framework to identify when influential data subsets excessively impact model conclusions, providing rigorous tests and applications across various fields.
Contribution
It introduces a principled influence testing framework for linear models, deriving exact formulas and extreme value distributions to detect excessive influence.
Findings
Derived exact influence formulas for linear least-squares.
Identified extreme value distributions for maximal influence.
Applied framework to real-world datasets, resolving contested findings.
Abstract
Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence - the heavy-tailed Fr\'echet for constant-size sets and heavy-tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing ad-hoc…
Peer Reviews
Decision·ICLR 2026 Poster
- The problem of testing most influential subsets is novel and well-motivated. - The proposed framework is theoretically sound. - The proposed approach may find broad applications in scientific domains that relies on OLS for data analysis.
- The proposed framework is limited to OLS. - The block-maxima MLE used in the proposed approach suffers from a bias. - The theory is only applicable to the low dim regime. Minor: this work may be a better fit to statistics or economic venues than machine learning ones.
1. Novel theoretical results: I'm not aware of related literatures that address the limiting distribution of the maximum influence for linear regression model. 2. Conceptual clarity and motivation: The identified gap is indeed an important research question to address.
1. Practical guidance: Since the theory focuses on asymptotic behavior, it is unclear how many samples are needed to render the theory applicable. 2. Presentation: The clarity of the paper can be improved by carefully restructuring the sections. For instance, it seems like the presentation before Section 3.2 largely follows [1]. However, some parts are not necessarily used, for instance, the influence function. While I won't say this is to the extent of "plagiarism", however, I do think some car
1. This paper tackles an important question on understanding the influential data subsets on the statistical estimator. 2. The presentation is easy to understand, with theoretical contributions.
1. The motivation for p-values in influential subset testing is not clearly justified. The main distinction from Broderick et al., (2021) is that this paper added a statistical significance test, but the paper does not convincingly demonstrate why this hypothesis-testing perspective materially improves decision-making in practice. For example, all the real data applications in this paper can be similarly done wth Broderick et al. (2020). I would suggest that the authors provide why the proposed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
