Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents
Xuan Liu, HaoYang Shang, Zizhang Liu, Yuanjun Feng, Guankai Zhai, Yunze Xiao, Yiwen Tu, Haojian Jin

TL;DR
This paper introduces a novel evaluation framework using validated social science hypotheses to assess how human-like AI agents are, emphasizing objective, scalable, and replicable measures of human-likeness.
Contribution
The authors present HumanStudy-Bench, an open platform that converts social science studies into simulation environments for evaluating AI agents' human-likeness.
Findings
Agent responses vary from full replication to failure.
Agent design impacts alignment more than model size.
Alignment effects are non-monotonic.
Abstract
We propose using validated behavioral hypotheses as a lens for evaluating human-likeness in LLM-based agents. Our key idea is simple: If an agent is human-like, a population of such agents should reach the same inferential conclusion as the human population when run through the same experiment. Decades of social science have produced many such validated findings, each anchored to concrete experimental protocols and robustly established through independent replication. This yields an evaluation that is objective, decomposable, and scalable. We operationalize this lens through HumanStudy-Bench, an open platform that turns published human-subject studies into reusable simulation environments and administers the evaluation to configurable agents. It scores agent-human alignment on two metrics: the Probability Alignment Score (PAS) for inferential agreement and the Effect Consistency Score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
