Residual Cross-Modal Fusion Networks for Audio-Visual Navigation

Yi Wang; Yinfeng Yu; Bin Ren

arXiv:2601.08868·cs.CV·January 15, 2026

Residual Cross-Modal Fusion Networks for Audio-Visual Navigation

Yi Wang, Yinfeng Yu, Bin Ren

PDF

Open Access

TL;DR

This paper introduces a novel residual cross-modal fusion network that enhances audio-visual navigation by effectively modeling interactions between modalities, leading to improved performance and generalization in unseen environments.

Contribution

The paper proposes a Cross-Modal Residual Fusion Network with bidirectional residual interactions, advancing multimodal fusion techniques for embodied navigation tasks.

Findings

01

CRFN outperforms existing fusion methods on benchmark datasets.

02

Agents show different modality reliance across datasets.

03

The approach improves cross-domain generalization and robustness.

Abstract

Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Music and Audio Processing