Partial identification of kernel based two sample tests with mismeasured   data

Ron Nafshi; Maggie Makar

arXiv:2308.03570·stat.ML·August 8, 2023

Partial identification of kernel based two sample tests with mismeasured data

Ron Nafshi, Maggie Makar

PDF

Open Access

TL;DR

This paper addresses the challenge of estimating the Maximum Mean Discrepancy (MMD) between two distributions when data is contaminated, proposing a partial identification approach with bounds that outperform existing methods.

Contribution

It introduces a novel partial identification method for MMD under contamination, providing sharp bounds and a consistent estimation procedure with faster convergence.

Findings

01

The proposed bounds accurately contain the true MMD in contaminated data scenarios.

02

The estimation method converges faster than alternative approaches as sample size increases.

03

Empirical results show the method produces tight bounds with low false coverage rate.

Abstract

Nonparametric two-sample tests such as the Maximum Mean Discrepancy (MMD) are often used to detect differences between two distributions in machine learning applications. However, the majority of existing literature assumes that error-free samples from the two distributions of interest are available.We relax this assumption and study the estimation of the MMD under $ϵ$ -contamination, where a possibly non-random $ϵ$ proportion of one distribution is erroneously grouped with the other. We show that under $ϵ$ -contamination, the typical estimate of the MMD is unreliable. Instead, we study partial identification of the MMD, and characterize sharp upper and lower bounds that contain the true, unknown MMD. We propose a method to estimate these bounds, and show that it gives estimates that converge to the sharpest possible bounds on the MMD as sample size increases, with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Statistical Methods and Inference · Machine Learning and Algorithms