MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

Yin Cai; Zhouhong Gu; Zhaohan Du; Zheyu Ye; Shaosheng Cao; Yiqian Xu; Hongwei Feng; Ping Chen

arXiv:2501.01652·cs.CL·January 21, 2026

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

Yin Cai, Zhouhong Gu, Zhaohan Du, Zheyu Ye, Shaosheng Cao, Yiqian Xu, Hongwei Feng, Ping Chen

PDF

Open Access 1 Repo

TL;DR

This paper presents MIRAGE, a comprehensive framework for evaluating large language models' ability to perform complex social interactions through murder mystery games, revealing current models' limitations in understanding and role-playing complex human behaviors.

Contribution

Introduces MIRAGE, a novel evaluation framework with diverse scripts and metrics to assess LLMs' social interactive capabilities in complex role-playing scenarios.

Findings

01

GPT-4 struggles with complex social interactions in MIRAGE

02

MIRAGE's metrics reveal limitations in LLMs' trust and role-playing abilities

03

The framework provides a new benchmark for social intelligence in LLMs

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in environmental perception, reasoning-based decision-making, and simulating complex human behaviors, particularly in interactive role-playing contexts. This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE), a comprehensive framework designed to assess LLMs' proficiency in portraying advanced human behaviors through murder mystery games. MIRAGE features eight intricately crafted scripts encompassing diverse themes and styles, providing a rich simulation. To evaluate LLMs' performance, MIRAGE employs four distinct methods: the Trust Inclination Index (TII) to measure dynamics of trust and suspicion, the Clue Investigation Capability (CIC) to measure LLMs' capability of conducting information, the Interactivity Capability Index (ICI) to assess role-playing capabilities and the Script…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lime728/mirage
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Absolute Position Encodings · Softmax · Linear Layer · Adam · Residual Connection · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing