"Influence Sketching": Finding Influential Samples In Large-Scale   Regressions

Mike Wojnowicz; Ben Cruz; Xuan Zhao; Brian Wallace; Matt Wolff; Jay; Luan; and Caleb Crable

arXiv:1611.05923·stat.ML·May 11, 2017

"Influence Sketching": Finding Influential Samples In Large-Scale Regressions

Mike Wojnowicz, Ben Cruz, Xuan Zhao, Brian Wallace, Matt Wolff, Jay, Luan, and Caleb Crable

PDF

TL;DR

This paper introduces influence sketching, a scalable algorithm that identifies influential samples in large-scale regressions, improving data inspection for model robustness and vulnerability detection.

Contribution

The paper develops influence sketching, a novel scalable method embedding random projections into influence computation for large high-dimensional datasets.

Findings

01

Influence sketching accurately identifies influential samples in large datasets.

02

Removing influential samples significantly decreases model accuracy.

03

Influence sketching uncovers previously unidentified malware samples.

Abstract

There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.