M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games
Sixiong Xie, Zhuofan Shi, Haiyang Shen, Yun Ma, Xiang Jing

TL;DR
M3-BENCH introduces a process-aware evaluation framework for LLM agents in mixed-motive games, revealing nuanced social behaviors and safety risks overlooked by outcome-only metrics.
Contribution
It presents a comprehensive benchmark with three analysis views, highlighting the importance of process signals in assessing LLM social competence.
Findings
Significant differences in social competence are uncovered by process-aware evaluation.
Models often overthink but undercommunicate, failing to translate reasoning into effective communication.
Outcome-only metrics miss safety risks like opportunistic reasoning paired with cooperation.
Abstract
Existing benchmarks for LLM agents' social behavior typically focus on a single capability dimension and evaluate only behavioral outcomes, overlooking process signals from reasoning and communication. We present M3-BENCH, a benchmark of 24 mixed-motive games with a process-aware evaluation framework spanning three complementary views: Behavioral Trajectory Analysis (BTA), Reasoning Process Analysis (RPA), and Communication Content Analysis (CCA). Evaluating 11 frontier LLMs and a human baseline, M3-BENCH reveals substantial differences in social competence that outcome-only evaluation misses. In particular, we identify an "overthink-undercommunicate" pattern: reasoning models achieve strong internal deliberation scores but often fail to translate them into effective social communication. Although top models can surpass humans on task outcomes, humans exhibit markedly higher cross-view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
