Approximations to worst-case data dropping: unmasking failure modes
Jenny Y. Huang, David R. Burt, Yunyi Shen, Tin D. Nguyen, and Tamara Broderick

TL;DR
This paper investigates the effectiveness of various approximation methods for detecting worst-case data dropping scenarios that could alter study conclusions, revealing many methods fail and highlighting a simple recursive greedy algorithm as a reliable and efficient alternative.
Contribution
The paper demonstrates the limitations of existing approximation methods for detecting non-robustness in data dropping, proposing a simple recursive greedy algorithm as a robust and faster solution.
Findings
Many approximation methods fail to detect true non-robustness.
A simple recursive greedy algorithm consistently detects non-robustness.
The greedy algorithm is significantly faster than competing methods.
Abstract
A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Checking this non-robustness directly poses a combinatorial optimization problem and is intractable even for simple models and moderate data sizes. Recently various authors have proposed a diverse set of approximations to detect this non-robustness. In the present work, we show that, even in a setting as simple as ordinary least squares (OLS) linear regression, many of these approximations can fail to detect (true) non-robustness in realistic data arrangements. We focus on OLS in the present work due its widespread use and since some approximations work only for OLS. Across our synthetic and real-world data sets, we find that a simple recursive greedy algorithm is the sole algorithm that does not fail any of our tests and also that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Software System Performance and Reliability · Simulation Techniques and Applications
