M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games

Sixiong Xie; Zhuofan Shi; Haiyang Shen; Yun Ma; Xiang Jing

arXiv:2601.08462·cs.AI·April 3, 2026

M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Yun Ma, Xiang Jing

PDF

1 Datasets

TL;DR

M3-BENCH introduces a process-aware evaluation framework for LLM agents in mixed-motive games, revealing nuanced social behaviors and safety risks overlooked by outcome-only metrics.

Contribution

It presents a comprehensive benchmark with three analysis views, highlighting the importance of process signals in assessing LLM social competence.

Findings

01

Significant differences in social competence are uncovered by process-aware evaluation.

02

Models often overthink but undercommunicate, failing to translate reasoning into effective communication.

03

Outcome-only metrics miss safety risks like opportunistic reasoning paired with cooperation.

Abstract

Existing benchmarks for LLM agents' social behavior typically focus on a single capability dimension and evaluate only behavioral outcomes, overlooking process signals from reasoning and communication. We present M3-BENCH, a benchmark of 24 mixed-motive games with a process-aware evaluation framework spanning three complementary views: Behavioral Trajectory Analysis (BTA), Reasoning Process Analysis (RPA), and Communication Content Analysis (CCA). Evaluating 11 frontier LLMs and a human baseline, M3-BENCH reveals substantial differences in social competence that outcome-only evaluation misses. In particular, we identify an "overthink-undercommunicate" pattern: reasoning models achieve strong internal deliberation scores but often fail to translate them into effective social communication. Although top models can surpass humans on task outcomes, humans exhibit markedly higher cross-view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

molmohsen/awesome-ai-agent-papers
dataset· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.