Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
Tianyu Wang, Jiajun Li, Jianghao Lin

TL;DR
This paper introduces ConsumerSimBench, a benchmark for evaluating LLMs' ability to reconstruct real consumer reactions from Chinese social media, revealing significant gaps between model performance and actual consumer intuition.
Contribution
The paper presents a new benchmark built from real social media data, with a decomposed, auditable evaluation method, highlighting the limitations of current LLMs in consumer reaction prediction.
Findings
Strongest model covers only 47.8% of real reaction criteria.
GPT-5.2 and Claude-4.6 perform poorly despite benchmark strength.
Structured reasoning prompts decrease coverage, multi-agent pipelines improve performance.
Abstract
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
