Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation
Siyuan Li, Jiani Lu, Yu Song, Xianren Li, Bo An, Peng Liu

TL;DR
This paper introduces a hierarchical multimodal fusion framework that integrates audio, vision, and proprioception for precise robotic manipulation, demonstrating improved performance especially when acoustic cues are informative.
Contribution
The paper proposes a novel hierarchical fusion approach that explicitly models cross-modal interactions and conditions visual and proprioceptive data on acoustic cues for robotic tasks.
Findings
Outperforms state-of-the-art multimodal fusion methods in real-world tasks.
Effectively leverages acoustic cues to improve manipulation accuracy.
Mutual information analysis clarifies the role of audio in multimodal perception.
Abstract
Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Speech and Audio Processing
