Worst-Case Analysis for Randomly Collected Data
Justin Y. Chen, Gregory Valiant, Paul Valiant

TL;DR
This paper presents a new framework for statistical estimation that accounts for how data samples are collected without assuming any distribution on data values, providing an efficient algorithm with bounded worst-case expected error.
Contribution
It introduces a novel worst-case analysis framework based on the data collection process and offers an efficient estimation algorithm with provable error bounds, connecting to the Grothendieck problem.
Findings
Algorithm achieves at most a π/2 factor worse error than optimal
Framework applies to importance sampling, snowball sampling, and selective prediction
Provides a uniform analysis for data collection-aware estimation methods
Abstract
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements with corresponding data values . We observe the values for a "sample" set and wish to estimate some statistic of the values for a "target" set where could be the entire set. Crucially, we assume that the sets and are drawn according to some known distribution over pairs of subsets of . A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution from which the sample and target sets are drawn, and the worst-case is with respect to the data values . Within this framework, we give…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Machine Learning and Algorithms · Statistical Methods and Inference
