EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, and Xin Eric Wang

TL;DR
EnactToM introduces a dynamic benchmark with 300 multi-agent tasks in 3D environments to evaluate and improve AI agents' functional Theory of Mind capabilities, especially in implicit belief reasoning.
Contribution
The paper presents EnactToM, a novel evolving benchmark for testing functional Theory of Mind in embodied agents, with formal verification and difficulty progression.
Findings
Frontier models score 0.0% on functional task completion
Models achieve 45.0% on literal belief probes
93% of failures are due to epistemic coordination breakdowns
Abstract
Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
