SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Hikaru Shindo, Hanzhao Lin, Lukas Helff, Patrick Schramowski, Kristian Kersting

TL;DR
SocialGrid is a new benchmark environment inspired by Among Us, designed to evaluate social reasoning, planning, and task execution in embodied multi-agent systems using Large Language Models.
Contribution
We introduce SocialGrid, a comprehensive environment with metrics and a leaderboard to assess and improve social reasoning in embodied multi-agent LLMs.
Findings
GPT-OSS-120B achieves below 60% accuracy in task completion and planning.
Agents struggle with deception detection, performing near random chance.
Planning assistance improves task success but social reasoning remains a bottleneck.
Abstract
As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
