Variable Selection for Kernel Two-Sample Tests

Jie Wang; Santanu S. Dey; Yao Xie

arXiv:2302.07415·stat.ML·December 23, 2024

Variable Selection for Kernel Two-Sample Tests

Jie Wang, Santanu S. Dey, Yao Xie

PDF

Open Access

TL;DR

This paper introduces a new kernel-based variable selection method for two-sample tests that improves detection power in high-dimensional data by selecting the most informative variables.

Contribution

It proposes a novel framework using kernel MMD with mixed-integer programming and algorithms, providing theoretical power analysis and superior empirical performance.

Findings

01

Sample size depends on selected variables, not data dimension.

02

Method outperforms existing variable selection techniques.

03

Effective for high-dimensional two-sample testing.

Abstract

We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to determine whether two collections of samples follow the same distribution. To address this, we propose a novel framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a subset of variables with a pre-specified size that maximizes the variance-regularized kernel MMD statistic. We focus on three commonly used types of kernels: linear, quadratic, and Gaussian. From a computational perspective, we derive mixed-integer programming formulations and propose exact and approximation algorithms with performance guarantees to solve these formulations. From a statistical viewpoint, we derive the rate of testing power of our framework under appropriate conditions. These results show that the sample size requirements for the three kernels depend crucially on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Machine Learning and Data Classification · Machine Learning and Algorithms