Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models
Hongyu Chen, David Simchi-Levi, Ruoxuan Xiong

TL;DR
This paper introduces a novel partial identification method leveraging pretrained model predictions as weak shadow variables to obtain sharp bounds on population quantities under MNAR missing data, improving estimation accuracy.
Contribution
It develops a framework that incorporates outcome predictions from pretrained models into partial identification, relaxing classical assumptions and tightening bounds in missing data problems.
Findings
LLM predictions significantly reduce identification intervals by 75-83%.
The proposed method maintains valid coverage under realistic MNAR mechanisms.
Predictions remain effective even when classical shadow-variable conditions are not met.
Abstract
Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Bandit Algorithms Research · Sentiment Analysis and Opinion Mining
