Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind
Lance Ying, Xinyi Li, Shivam Aarya, Yizirui Fang, Yifan Yin, Jason Xinyu Liu, Stefanie Tellex, Joshua B. Tenenbaum, Tianmin Shu

TL;DR
This paper introduces SIFToM, a neurosymbolic model inspired by human cognition, enabling robots to pragmatically interpret and follow spoken instructions in noisy, real-world environments by leveraging a Theory of Mind approach.
Contribution
The paper presents a novel cognitively inspired neurosymbolic model that improves spoken instruction following in robots by integrating mental inference with vision-language understanding.
Findings
SIFToM outperforms existing VLMs in noisy conditions
Achieves near human-level accuracy in real-world tasks
Significantly enhances robot instruction following performance
Abstract
Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Neural Networks and Applications · Speech and dialogue systems
