Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

Zikun Ye; Hema Yoganarasimhan

arXiv:2604.17267·cs.AI·April 21, 2026

Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

Zikun Ye, Hema Yoganarasimhan

PDF

TL;DR

This paper proposes a framework for optimally allocating human survey responses when using LLMs for synthetic data, improving efficiency by focusing human effort on less reliable tasks.

Contribution

It introduces a novel method combining rectification difficulty, optimal sample allocation, and meta-learning to enhance survey accuracy without pilot data.

Findings

01

Achieves 11.4% and 10.5% MSE reductions in real datasets.

02

Captures 61-79% of the theoretical efficiency gains.

03

Validates the approach across multiple domains and LLMs.

Abstract

Large Language Models can generate synthetic survey responses at low cost, but their accuracy varies unpredictably across questions. We study the design problem of allocating a fixed budget of human respondents across estimation tasks when cheap LLM predictions are available for every task. Our framework combines three components. First, building on Prediction-Powered Inference, we characterize a question-specific rectification difficulty that governs how quickly the estimator's variance decreases with human sample size. Second, we derive a closed-form optimal allocation rule that directs more human labels to tasks where the LLM is least reliable. Third, since rectification difficulty depends on unobserved human responses for new surveys, we propose a meta-learning approach, trained on historical data, that predicts it for entirely new tasks without pilot data. The framework extends to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.