Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget
Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy

TL;DR
This paper introduces a minimax-optimal data collection strategy for estimating population and group means from multiple sources with different costs, optimizing the effective sample size within a fixed budget.
Contribution
It develops a novel sampling plan based on chi-squared divergence that maximizes effective sample size and proves its minimax optimality for population mean estimation.
Findings
The proposed sampling plan outperforms naive strategies.
The method achieves minimax optimal risk bounds.
Extensions to prediction problems are provided.
Abstract
Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Statistical Methods and Inference · Advanced Bandit Algorithms Research
