Robustness Auditing for Linear Regression: To Singularity and Beyond
Ittai Rubinstein, Samuel B. Hopkins

TL;DR
This paper introduces an efficient algorithm to certify the robustness of linear regression models against sample removal, addressing limitations of previous methods and providing the first non-trivial certificates for high-dimensional econometrics datasets.
Contribution
We develop a novel, computationally efficient algorithm for certifying linear regression robustness to sample removal, applicable to large, high-dimensional datasets, with tight bounds under certain assumptions.
Findings
Algorithm successfully certifies robustness on datasets with hundreds of dimensions.
First non-trivial robustness certificates achieved for datasets of dimension 4 or higher.
Bounds are tight up to a 1 + o(1) factor under distributional assumptions.
Abstract
It has recently been discovered that the conclusions of many highly influential econometrics studies can be overturned by removing a very small fraction of their samples (often less than ). These conclusions are typically based on the results of one or more Ordinary Least Squares (OLS) regressions, raising the question: given a dataset, can we certify the robustness of an OLS fit on this dataset to the removal of a given number of samples? Brute-force techniques quickly break down even on small datasets. Existing approaches which go beyond brute force either can only find candidate small subsets to remove (but cannot certify their non-existence) [BGM20, KZC21], are computationally intractable beyond low dimensional settings [MR22], or require very strong assumptions on the data distribution and too many samples to give reasonable bounds in practice [BP21, FH23]. We present an…
Peer Reviews
Decision·ICLR 2025 Poster
This paper is well written and proposes a nice solution to a simple fundamental problem. Due to the challenging nature of this problem, their results still hinge on a reduction to an NP-complete problem. However, they are able to give performance bounds on their algorithm's tightness (both theoretically and experimentally), and crucially their algorithm improves over the naive approach (trying all subsets of size $k$) enough to provide meaningful audits of how robust the results from real-life l
There is, of course, a gap between the theoretical bounds on $\frac{U_k}{L_k}$ and its behavior in practice (probably due to the nature of the assumptions needed to prove bounds on this quantity). Additionally, this algorithm is probably still not feasible for very large values of $k$. However, I do view this last weakness to be quite thoroughly addressed with the comprehensive empirical evaluation provided on real regressions.
1) The paper addresses an interesting problem of auditing in linear regression. 2) Rigorous theoretical analysis using interesting proof techniques like bounds via maximal subset sum norm (MSN) or knapsack-style dynamic programming. 3) The authors have tried experiments on real datasets.
1) Line 22: Instead of saying hundreds of dimensions, the authors should specify the exact value of d and n. Later in the paper, the authors have specified that they assume n>>d. Hence, saying hundreds of dimensions sounds slightly misleading. 2) Line 135: The authors claim no assumptions whatsoever on X and Y. Aren’t you assuming well-behaved in Definition 1 and also iid? Also, see line 159 about assumptions. 3) Line 253: Authors claim distributions with heavier tails than sub-exponential can
* The authors provide an implementation in their anonymous repository. The algorithm proposed by them is able to run on consumer grade machines for datasets of sizes of order $n = 10^{4}$. * The main text provides an excellent introduction to ACRE, explaining the idea behind its definition. The connection to MSN (Maximal Subset-sum Norm) is clearly explained.
* The clarity of the paper is a central issue. In Section 3, the intuition for the actual definition of the algorithm explains the setting and assumptions under which the result is obtained, and the authors believe the algorithm's work is not precise. The connection between theoretical guarantees and practical performance needs to be better explained. for example Theorem1.2 and 1.3 could be tested in varying the numbers $n,d$ used for Table 1 * The process of obtaining the values in Table 1 ne
Code & Models
Videos
Taxonomy
TopicsAdvanced Statistical Methods and Models
