Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
Zikun Ye, Hema Yoganarasimhan

TL;DR
This paper proposes a framework for optimally allocating human survey responses when using LLMs for synthetic data, improving efficiency by focusing human effort on less reliable tasks.
Contribution
It introduces a novel method combining rectification difficulty, optimal sample allocation, and meta-learning to enhance survey accuracy without pilot data.
Findings
Achieves 11.4% and 10.5% MSE reductions in real datasets.
Captures 61-79% of the theoretical efficiency gains.
Validates the approach across multiple domains and LLMs.
Abstract
Large Language Models can generate synthetic survey responses at low cost, but their accuracy varies unpredictably across questions. We study the design problem of allocating a fixed budget of human respondents across estimation tasks when cheap LLM predictions are available for every task. Our framework combines three components. First, building on Prediction-Powered Inference, we characterize a question-specific rectification difficulty that governs how quickly the estimator's variance decreases with human sample size. Second, we derive a closed-form optimal allocation rule that directs more human labels to tasks where the LLM is least reliable. Third, since rectification difficulty depends on unobserved human responses for new surveys, we propose a meta-learning approach, trained on historical data, that predicts it for entirely new tasks without pilot data. The framework extends to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
