Optimal Data Collection for Randomized Control Trials
Pedro Carneiro, Sokbae Lee, Daniel Wilhelm

TL;DR
This paper introduces a method to optimize data collection in randomized control trials by using pre-experimental data to select sample size and covariates, reducing costs and improving estimator precision.
Contribution
It proposes a simple, tuning-free algorithm that leverages pre-experimental data to minimize mean squared error under budget constraints in RCTs.
Findings
Up to 58% reduction in data collection costs.
Significant improvements in treatment effect estimator precision.
Applicable to large sets of potential covariates.
Abstract
In a randomized control trial, the precision of an average treatment effect estimator can be improved either by collecting data on additional individuals, or by collecting additional covariates that predict the outcome variable. We propose the use of pre-experimental data such as a census, or a household survey, to inform the choice of both the sample size and the covariates to be collected. Our procedure seeks to minimize the resulting average treatment effect estimator's mean squared error, subject to the researcher's budget constraint. We rely on a modification of an orthogonal greedy algorithm that is conceptually simple and easy to implement in the presence of a large number of potential covariates, and does not require any tuning parameters. In two empirical applications, we show that our procedure can lead to substantial gains of up to 58%, measured either in terms of reductions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Statistical Methods and Inference · Statistical Methods and Bayesian Inference
