A regression approach to the two-dataset problem

Steven N. MacEachern; Koji Miyawaki

arXiv:1911.00204·stat.ME·September 27, 2022

A regression approach to the two-dataset problem

Steven N. MacEachern, Koji Miyawaki

PDF

Open Access

TL;DR

This paper introduces regression models to address the two-dataset problem, enabling analysis of data heterogeneity from different sources and improving the validity of conclusions drawn from combined datasets.

Contribution

It develops novel regression-based methods and prediction error metrics to evaluate data collection differences and heterogeneity between datasets from different populations.

Findings

01

Effective in distinguishing dataset differences

02

Applicable to real-world data from diverse sources

03

Enhances validity of combined data analysis

Abstract

This paper considers the two-dataset problem, where data are collected from two potentially different populations sharing common aspects. This problem arises when data are collected by two different types of researchers or from two different sources. We may reach invalid conclusions without using knowledge about the data collection process. To address this problem, this paper develops statistical regression models focusing on the difference in measurement and proposes two prediction errors that help to evaluate the underlying data collection process. As a consequence, it is possible to discuss the heterogeneity/similarity of the set of predictors in terms of prediction. Two real datasets are selected to illustrate our method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Statistical Methods and Inference · Bayesian Methods and Mixture Models