Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

Siyuan Li; Jiani Lu; Yu Song; Xianren Li; Bo An; Peng Liu

arXiv:2602.13640·cs.RO·February 17, 2026

Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

Siyuan Li, Jiani Lu, Yu Song, Xianren Li, Bo An, Peng Liu

PDF

Open Access

TL;DR

This paper introduces a hierarchical multimodal fusion framework that integrates audio, vision, and proprioception for precise robotic manipulation, demonstrating improved performance especially when acoustic cues are informative.

Contribution

The paper proposes a novel hierarchical fusion approach that explicitly models cross-modal interactions and conditions visual and proprioceptive data on acoustic cues for robotic tasks.

Findings

01

Outperforms state-of-the-art multimodal fusion methods in real-world tasks.

02

Effectively leverages acoustic cues to improve manipulation accuracy.

03

Mutual information analysis clarifies the role of audio in multimodal perception.

Abstract

Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Speech and Audio Processing