Contamination Estimation via Convex Relaxations

Matthew L. Malloy; Scott Alfeld; Paul Barford

arXiv:1506.04257·cs.IT·June 16, 2015

Contamination Estimation via Convex Relaxations

Matthew L. Malloy, Scott Alfeld, Paul Barford

PDF

TL;DR

This paper introduces a convex relaxation-based method for estimating contamination levels in large discrete datasets by identifying the minimal data removal needed to fit a specified model within a certain goodness-of-fit, supported by theoretical guarantees.

Contribution

The authors develop a novel convex relaxation approach to estimate contamination levels, providing theoretical bounds and convergence guarantees for large datasets.

Findings

01

Convex programs effectively estimate contamination levels.

02

Theoretical bounds converge at rate O(√(log p)/p).

03

Method applies to large, discrete datasets.

Abstract

Identifying anomalies and contamination in datasets is important in a wide variety of settings. In this paper, we describe a new technique for estimating contamination in large, discrete valued datasets. Our approach considers the normal condition of the data to be specified by a model consisting of a set of distributions. Our key contribution is in our approach to contamination estimation. Specifically, we develop a technique that identifies the minimum number of data points that must be discarded (i.e., the level of contamination) from an empirical data set in order to match the model to within a specified goodness-of-fit, controlled by a p-value. Appealing to results from large deviations theory, we show a lower bound on the level of contamination is obtained by solving a series of convex programs. Theoretical results guarantee the bound converges at a rate of $O (lo g (p) / p)$ ,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.