HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Xuan Liu; Haoyang Shang; Zizhang Liu; Xinyan Liu; Yunze Xiao; Yiwen Tu; Haojian Jin

arXiv:2602.00685·cs.AI·February 3, 2026

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces HUMANSTUDY-BENCH, a benchmark and framework for designing and evaluating AI agents that simulate human participants in social science experiments, aiming to improve fidelity and reproducibility.

Contribution

It presents a novel agent-design framework and a comprehensive benchmark for reconstructing and evaluating human-subject experiments using LLM-based agents.

Findings

01

Successfully instantiated 12 foundational studies with over 6,000 trials.

02

Developed new metrics to quantify agreement between human and agent behaviors.

03

Reproduced original statistical procedures end-to-end in a shared runtime.

Abstract

Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter--Extract--Execute--Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

fuyyckwhy/HS-Bench-results
dataset· 3.0k dl
3.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Explainable Artificial Intelligence (XAI) · Mobile Crowdsensing and Crowdsourcing