Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Chang Nie; Tianchen Deng; Guangming Wang; Zhe Liu; Hesheng Wang

arXiv:2603.16086·cs.RO·March 18, 2026

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang

PDF

Open Access

TL;DR

This paper introduces the HEAR framework for continuous, sound-centric manipulation in embodied agents, integrating vision, audio, language, and proprioception to improve real-time environmental interaction.

Contribution

It formalizes the Vision-Sound-Language-Action paradigm and presents HEAR, a novel multi-sensory framework with components for causal audio context, reasoning, temporal prediction, and action generation.

Findings

01

Causal audio context maintenance improves sound event detection.

02

Temporal dynamics learning enhances manipulation accuracy.

03

Benchmark results show robustness of the HEAR framework.

Abstract

While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Social Robot Interaction and HRI · Multimodal Machine Learning Applications