Similarity of samples and trimming

Pedro C. \'Alvarez-Esteban; Eustasio del Barrio; Juan A.; Cuesta-Albertos; Carlos Matr\'an

arXiv:1205.1950·math.ST·May 10, 2012

Similarity of samples and trimming

Pedro C. \'Alvarez-Esteban, Eustasio del Barrio, Juan A., Cuesta-Albertos, Carlos Matr\'an

PDF

TL;DR

This paper introduces a model for assessing the similarity of probability distributions based on contamination levels and explores how trimming affects empirical measures, proposing a bootstrap method for practical similarity testing.

Contribution

It establishes a connection between similarity of probabilities and minimal distances between trimmed probability sets, and develops a bootstrap approach for empirical similarity assessment.

Findings

01

Overfitting occurs when trimming exceeds the similarity level.

02

Empirical trimmed samples tend to be closer than expected.

03

Bootstrap method effectively assesses similarity from data samples.

Abstract

We say that two probabilities are similar at level $α$ if they are contaminated versions (up to an $α$ fraction) of the same common probability. We show how this model is related to minimal distances between sets of trimmed probabilities. Empirical versions turn out to present an overfitting effect in the sense that trimming beyond the similarity level results in trimmed samples that are closer than expected to each other. We show how this can be combined with a bootstrap approach to assess similarity from two data samples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.