How Far Are LLMs from Believable AI? A Benchmark for Evaluating the   Believability of Human Behavior Simulation

Yang Xiao; Yi Cheng; Jinlan Fu; Jiashuo Wang; Wenjie Li; Pengfei Liu

arXiv:2312.17115·cs.CL·June 18, 2024·1 cites

How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation

Yang Xiao, Yi Cheng, Jinlan Fu, Jiashuo Wang, Wenjie Li, Pengfei Liu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces SimulateBench, a systematic benchmark for evaluating the believability of LLMs in simulating human behaviors, focusing on consistency and robustness across diverse character profiles.

Contribution

The work presents a novel benchmark, SimulateBench, for assessing LLMs' ability to simulate human behaviors convincingly and reliably, addressing a gap in systematic evaluation methods.

Findings

01

Current LLMs often fail to maintain character consistency.

02

LLMs show vulnerability to behavioral perturbations.

03

Performance varies significantly across different models.

Abstract

In recent years, AI has demonstrated remarkable capabilities in simulating human behaviors, particularly those implemented with large language models (LLMs). However, due to the lack of systematic evaluation of LLMs' simulated behaviors, the believability of LLMs among humans remains ambiguous, i.e., it is unclear which behaviors of LLMs are convincingly human-like and which need further improvements. In this work, we design SimulateBench to evaluate the believability of LLMs when simulating human behaviors. In specific, we evaluate the believability of LLMs based on two critical dimensions: 1) consistency: the extent to which LLMs can behave consistently with the given information of a human to simulate; and 2) robustness: the ability of LLMs' simulated behaviors to remain robust when faced with perturbations. SimulateBench includes 65 character profiles and a total of 8,400 questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llmconference/emnlp_conference_2024
noneOfficial

Datasets

YangXiao-nlp/SimulateBench
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)

MethodsALIGN