Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

Reagan Mozer; Nicole E. Pashley; Luke Miratrix

arXiv:2602.12992·stat.ME·February 16, 2026

Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes

Reagan Mozer, Nicole E. Pashley, Luke Miratrix

PDF

Open Access

TL;DR

This paper introduces a stratified sampling method for model-assisted estimation with surrogate outcomes, improving efficiency by strategically allocating human coding effort in costly outcome measurement scenarios.

Contribution

It extends existing model-assisted estimation by incorporating stratified sampling and derives the variance and optimal allocation rules, enhancing efficiency especially with structured surrogate error.

Findings

01

Stratification improves efficiency with structured surrogate error.

02

Optimal allocation oversamples strata with larger residual variance.

03

Method performs well in simulations and real applications.

Abstract

In many randomized trials, outcomes such as essays or open-ended responses must be manually scored as a preliminary step to impact analysis, a process that is costly and limiting. Model-assisted estimation offers a way to combine surrogate outcomes generated by machine learning or large language models with a human-coded subset, yet typical implementations use simple random sampling and therefore overlook systematic variation in surrogate prediction error. We extend this framework by incorporating stratified sampling to more efficiently allocate human coding effort. We derive the exact variance of the stratified model-assisted estimator, characterize conditions under which stratification improves precision, and identify a Neyman-type optimal allocation rule that oversamples strata with larger residual variance. We evaluate our methods through a comprehensive simulation study to assess…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI