Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Lance Ying; Xinyi Li; Shivam Aarya; Yizirui Fang; Yifan Yin; Jason Xinyu Liu; Stefanie Tellex; Joshua B. Tenenbaum; Tianmin Shu

arXiv:2409.10849·cs.RO·October 7, 2025

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Lance Ying, Xinyi Li, Shivam Aarya, Yizirui Fang, Yifan Yin, Jason Xinyu Liu, Stefanie Tellex, Joshua B. Tenenbaum, Tianmin Shu

PDF

Open Access

TL;DR

This paper introduces SIFToM, a neurosymbolic model inspired by human cognition, enabling robots to pragmatically interpret and follow spoken instructions in noisy, real-world environments by leveraging a Theory of Mind approach.

Contribution

The paper presents a novel cognitively inspired neurosymbolic model that improves spoken instruction following in robots by integrating mental inference with vision-language understanding.

Findings

01

SIFToM outperforms existing VLMs in noisy conditions

02

Achieves near human-level accuracy in real-world tasks

03

Significantly enhances robot instruction following performance

Abstract

Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Neural Networks and Applications · Speech and dialogue systems