A Kernel Method for the Two-Sample Problem

Arthur Gretton; Karsten Borgwardt; Malte J. Rasch; Bernhard Scholkopf,; Alexander J. Smola

arXiv:0805.2368·cs.LG·May 16, 2008·249 cites

A Kernel Method for the Two-Sample Problem

Arthur Gretton, Karsten Borgwardt, Malte J. Rasch, Bernhard Scholkopf,, Alexander J. Smola

PDF

Open Access

TL;DR

This paper introduces a kernel-based statistical test for comparing two distributions, leveraging RKHS functions to detect differences efficiently and effectively across various data types.

Contribution

It presents a novel two-sample test framework using RKHS, with practical algorithms and applications to complex data like graphs and databases.

Findings

01

Effective in attribute matching for databases

02

First tests for distribution comparison over graphs

03

Performs well with quadratic and linear time algorithms

Abstract

We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Bayesian Modeling and Causal Inference · Machine Learning and Algorithms