Worst-Case Analysis for Randomly Collected Data

Justin Y. Chen; Gregory Valiant; Paul Valiant

arXiv:1911.03605·cs.DS·October 27, 2020

Worst-Case Analysis for Randomly Collected Data

Justin Y. Chen, Gregory Valiant, Paul Valiant

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a new framework for statistical estimation that accounts for how data samples are collected without assuming any distribution on data values, providing an efficient algorithm with bounded worst-case expected error.

Contribution

It introduces a novel worst-case analysis framework based on the data collection process and offers an efficient estimation algorithm with provable error bounds, connecting to the Grothendieck problem.

Findings

01

Algorithm achieves at most a π/2 factor worse error than optimal

02

Framework applies to importance sampling, snowball sampling, and selective prediction

03

Provides a uniform analysis for data collection-aware estimation methods

Abstract

We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements $[n] = 1, \dots, n$ with corresponding data values $x_{1}, \dots, x_{n}$ . We observe the values for a "sample" set $A \subset [n]$ and wish to estimate some statistic of the values for a "target" set $B \subset [n]$ where $B$ could be the entire set. Crucially, we assume that the sets $A$ and $B$ are drawn according to some known distribution $P$ over pairs of subsets of $[n]$ . A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution $P$ from which the sample $A$ and target sets $B$ are drawn, and the worst-case is with respect to the data values $x_{1}, \dots, x_{n}$ . Within this framework, we give…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

justc2/worst-case-randomly-collected
noneOfficial

Videos

Worst-Case Analysis for Randomly Collected Data· slideslive

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Machine Learning and Algorithms · Statistical Methods and Inference