Residual Cross-Modal Fusion Networks for Audio-Visual Navigation
Yi Wang, Yinfeng Yu, Bin Ren

TL;DR
This paper introduces a novel residual cross-modal fusion network that enhances audio-visual navigation by effectively modeling interactions between modalities, leading to improved performance and generalization in unseen environments.
Contribution
The paper proposes a Cross-Modal Residual Fusion Network with bidirectional residual interactions, advancing multimodal fusion techniques for embodied navigation tasks.
Findings
CRFN outperforms existing fusion methods on benchmark datasets.
Agents show different modality reliance across datasets.
The approach improves cross-domain generalization and robustness.
Abstract
Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Music and Audio Processing
