On the Use of Random Forest for Two-Sample Testing

Simon Hediger; Loris Michel; Jeffrey N\"af

arXiv:1903.06287·stat.ME·May 7, 2021·Comput. Stat. Data Anal.

On the Use of Random Forest for Two-Sample Testing

Simon Hediger, Loris Michel, Jeffrey N\"af

PDF

1 Repo

TL;DR

This paper introduces a new two-sample testing method based on Random Forest classifiers, which is easy to implement, requires minimal tuning, and provides insights into variable importance, with proven asymptotic power and real-world applications.

Contribution

It develops a novel Random Forest-based two-sample test with asymptotic power analysis and practical implementation via the hypoRF R-package.

Findings

01

The proposed test is easy to use and tune.

02

It is applicable to any distribution on .

03

Real-world applications demonstrate its effectiveness.

Abstract

Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on $R^{d}$ . Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribution. An asymptotic power analysis for the proposed tests is developed. Finally, two real-world applications illustrate the usefulness of the introduced methodology. To simplify the use of the method, the R-package "hypoRF" is provided.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hedigers/RandomForestTest
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.