Audio Spatially-Guided Fusion for Audio-Visual Navigation

Xinyu Zhou; Yinfeng Yu

arXiv:2604.02389·cs.SD·April 6, 2026

Audio Spatially-Guided Fusion for Audio-Visual Navigation

Xinyu Zhou, Yinfeng Yu

PDF

TL;DR

This paper introduces a novel audio-visual navigation method that enhances autonomous path planning in complex environments by adaptively fusing multimodal features using spatial audio cues, improving generalization.

Contribution

The proposed approach employs an audio spatial feature encoder and a spatial state guided fusion mechanism to dynamically align and fuse multimodal information, reducing noise and perceptual uncertainty.

Findings

01

Effective on unseen tasks with unknown sound sources

02

Improved generalization in complex 3D environments

03

Outperforms existing methods on Replica and Matterport3D datasets

Abstract

Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and path planning, thereby achieving autonomous navigation. The core challenge of this task lies in the following: how the agent can break free from the dependence on training data and achieve autonomous navigation with good generalization performance when facing changes in environments and sound sources. To address this challenge, we propose an Audio Spatially-Guided Fusion for Audio-Visual Navigation method. First, we design an audio spatial feature encoder, which adaptively extracts target-related spatial state information through an audio intensity attention mechanism; based on this, we introduce an Audio Spatial State Guided Fusion (ASGF) to achieve dynamic alignment and adaptive fusion of multimodal features, effectively alleviating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.