EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Gurusha Juneja; Dylan Lu; Saaket Agashe; Parth Diwane; Edward Gunn; Jayanth Srinivasa; Gaowen Liu; William Yang Wang; Yali Du; and Xin Eric Wang

arXiv:2605.09826·cs.AI·May 19, 2026

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, and Xin Eric Wang

PDF

TL;DR

EnactToM introduces a dynamic benchmark with 300 multi-agent tasks in 3D environments to evaluate and improve AI agents' functional Theory of Mind capabilities, especially in implicit belief reasoning.

Contribution

The paper presents EnactToM, a novel evolving benchmark for testing functional Theory of Mind in embodied agents, with formal verification and difficulty progression.

Findings

01

Frontier models score 0.0% on functional task completion

02

Models achieve 45.0% on literal belief probes

03

93% of failures are due to epistemic coordination breakdowns

Abstract

Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.