Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation
Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang

TL;DR
This paper introduces the HEAR framework for continuous, sound-centric manipulation in embodied agents, integrating vision, audio, language, and proprioception to improve real-time environmental interaction.
Contribution
It formalizes the Vision-Sound-Language-Action paradigm and presents HEAR, a novel multi-sensory framework with components for causal audio context, reasoning, temporal prediction, and action generation.
Findings
Causal audio context maintenance improves sound event detection.
Temporal dynamics learning enhances manipulation accuracy.
Benchmark results show robustness of the HEAR framework.
Abstract
While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Social Robot Interaction and HRI · Multimodal Machine Learning Applications
