Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen; Ruoxi Xu; Boxi Cao; Ruotong Pan; Yunfei Zhang; Yifei Hu; Yong Du; Tingting Gao; Yaojie Lu; Yingfei Sun; Xianpei Han; Le Sun; Xiangyu Wu; Hongyu Lin

arXiv:2604.08362·cs.CL·May 22, 2026

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

PDF

1 Repo 2 Datasets

TL;DR

This paper introduces OmniBehavior, a comprehensive benchmark from real-world data for evaluating large language models' ability to simulate complex, long-term human behaviors across diverse scenarios.

Contribution

It presents the first real-world data-driven benchmark for human behavior simulation and evaluates LLMs, revealing their limitations and biases in modeling authentic behaviors.

Findings

01

LLMs struggle with long-horizon, cross-scenario behaviors.

02

Current models tend to homogenize personalities and exhibit hyper-activity.

03

Performance plateaus even with larger context windows.

Abstract

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

icip-cas/OmniBehavior
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.