The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text
Andrew Hong, Jason Potteiger, Luis E. Zapata

TL;DR
This study evaluates how prompt design, model choice, and input text influence the accuracy of GPT models in predicting sports fans' experience ratings from open-ended survey responses, revealing fundamental measurement limits.
Contribution
It demonstrates that input text variability exceeds prompt and model effects, highlighting intrinsic limits in predicting subjective ratings from open-ended survey data.
Findings
Prompt customization slightly improves prediction accuracy.
Model swaps can degrade performance more than prompt changes.
Text content variability impacts accuracy more than prompt or model choice.
Abstract
An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
