SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Zixiang Xu, Yanbo Wang, Yue Huang, Jiayi Ye, Haomin Zhuang, Zirui Song, Lang Gao, Chenxi Wang, Zhaorun Chen, Yujun Zhou, Sixian Li, Wang Pan, Yue Zhao, Jieyu Zhao, Xiangliang Zhang, and Xiuying Chen

TL;DR
SocialMaze is a comprehensive benchmark designed to evaluate large language models' social reasoning abilities across diverse, challenging scenarios involving deep reasoning, dynamic interactions, and information uncertainty.
Contribution
It introduces SocialMaze, a novel benchmark with diverse tasks and validation methods to systematically assess social reasoning in large language models.
Findings
Models show varied ability in handling dynamic interactions.
Chain-of-thought reasoning improves performance on complex tasks.
Model reasoning degrades under uncertainty.
Abstract
Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The proposed time-aware, graph-based formalization of social interactions is both compelling and meaningful. I believe it can be essential for enabling dynamic social reasoning (but not well used in this work). 2. The empirical evaluation is detailed, showing clear differentiation between model families and reasoning strategies (e.g., Long vs. Short CoT). And detailed experiment settings are provided.
1. The definition of social reasoning abilities in this work is somewhat unclear. The authors appear to classify commonsense reasoning (Onoe et al., 2021; Lin et al., 2020), peer review (Tran et al., 2020; Szumega et al., 2023), and debating tasks (Tiwari et al., 2025) as instances of social reasoning. However, I am not fully convinced that these tasks necessarily involve the capacity to understand social context, infer others’ mental states, and make contextually appropriate judgments. Conseque
- The six tasks cover meaningfully different aspects of social reasoning, from role deduction to sentiment analysis to graph reasoning. - I like the empirical analysis across different variations in models: long cot, dynamic interaction, degradation under uncertainty - Unlike benchmarks on Werewolf and Avalon that only measure win rate, i like the added metrics in this paper.
- The paper structure quite weird - Section 3 describes three tasks in detail (3.1-3.3) while three others are relegated to a brief "parallel task set" (3.4) - 4-6 feel like afterthoughts - Consider restructuring to give equal treatment to all tasks, or clearly justify why some deserve more detailed exposition. - Also, the authors should consider naming section 4 as results or experiments - **Missing human baselines**: The paper lacks human performance benchmarks, making it diffi
- Coverage of social phenomena. The suite probes deception, self-role identification, multi-hop relational reasoning, and staged decision-making beyond static ToM/commonsense probes. - Empirical takeaways. i) Long-chain prompting helps on deeper tasks. ii) Agent/workflow variants offer, at best, marginal gains on hidden-role deduction. iii) Light SFT/DPO produces sizable improvements. - Beyond accuracy. The inclusion of round-by-round trajectories, uncertainty stress tests, and process metrics
The main concern to me is the ceiling effects on some tasks. Strong models achieve near-perfect scores on Social Graph Analysis, and the Review-Decision task shows ~90% for top systems in the final stage. This reduces discriminative power among frontier models.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Multimodal Machine Learning Applications
