Optimal Sample Splitting for Observational Studies

Qishuo Yin; Dylan S. Small

arXiv:2601.22782·stat.ME·February 2, 2026

Optimal Sample Splitting for Observational Studies

Qishuo Yin, Dylan S. Small

PDF

Open Access

TL;DR

This paper introduces a method to optimally split data into planning and analysis samples in observational studies to minimize bias from unmeasured confounders, improving study validity.

Contribution

It develops a data-driven approach using plasmode datasets to determine the best sample split fraction, enhancing bias reduction in observational research.

Findings

01

Method performs well in high-dimensional outcome spaces.

02

Application to second-hand smoke exposure demonstrates practical utility.

03

Optimal split improves bias assessment accuracy.

Abstract

In observational studies of treatment effects, estimates may be biased by unmeasured confounders, which can potentially affect the validity of the results. Understanding sensitivity to such biases helps assess how unmeasured confounding impacts credibility. The design of an observational study strongly influences its sensitivity to bias. Previous work has shown that the sensitivity to bias can be reduced by dividing a dataset into a planning sample and a larger analysis sample, where the planning sample guides design decisions. But the choice of what fraction of the data to put in the planning sample vs. the analysis sample was ad hoc. Here, we develop an approach to find the optimal fraction using plasmode datasets. We show that our method works well in high-dimensional outcome spaces. We apply our method to study the effects of exposure to second-hand smoke in children. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Data Analysis with R · Statistical Methods and Bayesian Inference