From Instruction to Event: Sound-Triggered Mobile Manipulation
Hao Ju, Shaofei Huang, Hongyu Li, Zihan Ding, Si Liu, Meng Wang, Zhedong Zheng

TL;DR
This paper introduces sound-triggered mobile manipulation, enabling agents to perceive and interact with sound-emitting objects without explicit commands, supported by a new data platform and a baseline system.
Contribution
It presents a novel sound-triggered manipulation paradigm, a new data platform Habitat-Echo, and a baseline system for active auditory perception and interaction in mobile agents.
Findings
Agents can detect and respond to auditory events without explicit instructions.
The baseline system successfully isolates primary sound sources in overlapping scenarios.
Agents can manipulate secondary objects after identifying primary sound sources.
Abstract
Current mobile manipulation research predominantly follows an instruction-driven paradigm, where agents rely on predefined textual commands to execute tasks. However, this setting confines agents to a passive role, limiting their autonomy and ability to react to dynamic environmental events. To address these limitations, we introduce sound-triggered mobile manipulation, where agents must actively perceive and interact with sound-emitting objects without explicit action instructions. To support these tasks, we develop Habitat-Echo, a data platform that integrates acoustic rendering with physical interaction. We further propose a baseline comprising a high-level task planner and low-level policy models to complete these tasks. Extensive experiments show that the proposed baseline empowers agents to actively detect and respond to auditory events, eliminating the need for case-by-case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
