Testing Most Influential Sets

Lucas Darius Konrad; Nikolas Kuschnig

arXiv:2510.20372·stat.ML·March 6, 2026

Testing Most Influential Sets

Lucas Darius Konrad, Nikolas Kuschnig

PDF

3 Reviews

TL;DR

This paper develops a formal framework to identify when influential data subsets excessively impact model conclusions, providing rigorous tests and applications across various fields.

Contribution

It introduces a principled influence testing framework for linear models, deriving exact formulas and extreme value distributions to detect excessive influence.

Findings

01

Derived exact influence formulas for linear least-squares.

02

Identified extreme value distributions for maximal influence.

03

Applied framework to real-world datasets, resolving contested findings.

Abstract

Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence - the heavy-tailed Fr\'echet for constant-size sets and heavy-tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing ad-hoc…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The problem of testing most influential subsets is novel and well-motivated. - The proposed framework is theoretically sound. - The proposed approach may find broad applications in scientific domains that relies on OLS for data analysis.

Weaknesses

- The proposed framework is limited to OLS. - The block-maxima MLE used in the proposed approach suffers from a bias. - The theory is only applicable to the low dim regime. Minor: this work may be a better fit to statistics or economic venues than machine learning ones.

Reviewer 02Rating 4Confidence 3

Strengths

1. Novel theoretical results: I'm not aware of related literatures that address the limiting distribution of the maximum influence for linear regression model. 2. Conceptual clarity and motivation: The identified gap is indeed an important research question to address.

Weaknesses

1. Practical guidance: Since the theory focuses on asymptotic behavior, it is unclear how many samples are needed to render the theory applicable. 2. Presentation: The clarity of the paper can be improved by carefully restructuring the sections. For instance, it seems like the presentation before Section 3.2 largely follows [1]. However, some parts are not necessarily used, for instance, the influence function. While I won't say this is to the extent of "plagiarism", however, I do think some car

Reviewer 03Rating 2Confidence 4

Strengths

1. This paper tackles an important question on understanding the influential data subsets on the statistical estimator. 2. The presentation is easy to understand, with theoretical contributions.

Weaknesses

1. The motivation for p-values in influential subset testing is not clearly justified. The main distinction from Broderick et al., (2021) is that this paper added a statistical significance test, but the paper does not convincingly demonstrate why this hypothesis-testing perspective materially improves decision-making in practice. For example, all the real data applications in this paper can be similarly done wth Broderick et al. (2020). I would suggest that the authors provide why the proposed

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.