"Influence Sketching": Finding Influential Samples In Large-Scale Regressions
Mike Wojnowicz, Ben Cruz, Xuan Zhao, Brian Wallace, Matt Wolff, Jay, Luan, and Caleb Crable

TL;DR
This paper introduces influence sketching, a scalable algorithm that identifies influential samples in large-scale regressions, improving data inspection for model robustness and vulnerability detection.
Contribution
The paper develops influence sketching, a novel scalable method embedding random projections into influence computation for large high-dimensional datasets.
Findings
Influence sketching accurately identifies influential samples in large datasets.
Removing influential samples significantly decreases model accuracy.
Influence sketching uncovers previously unidentified malware samples.
Abstract
There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
