Understanding collections of related datasets using dependent MMD   coresets

Sinead A. Williamson; Jette Henderson

arXiv:2006.14621·stat.ME·August 6, 2021

Understanding collections of related datasets using dependent MMD coresets

Sinead A. Williamson, Jette Henderson

PDF

1 Repo

TL;DR

This paper introduces dependent MMD coresets, a novel data summarization method that enables effective comparison of multiple related datasets and provides insights into dataset differences and model generalization.

Contribution

The paper proposes dependent MMD coresets, a new approach for summarizing and comparing collections of related datasets to better understand their differences and implications for model performance.

Findings

01

Dependent MMD coresets facilitate dataset comparison.

02

They help identify under-represented sub-populations.

03

The method improves understanding of model generalization.

Abstract

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepency (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sinead/dmmd
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.