Audio Spatially-Guided Fusion for Audio-Visual Navigation
Xinyu Zhou, Yinfeng Yu

TL;DR
This paper introduces a novel audio-visual navigation method that enhances autonomous path planning in complex environments by adaptively fusing multimodal features using spatial audio cues, improving generalization.
Contribution
The proposed approach employs an audio spatial feature encoder and a spatial state guided fusion mechanism to dynamically align and fuse multimodal information, reducing noise and perceptual uncertainty.
Findings
Effective on unseen tasks with unknown sound sources
Improved generalization in complex 3D environments
Outperforms existing methods on Replica and Matterport3D datasets
Abstract
Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and path planning, thereby achieving autonomous navigation. The core challenge of this task lies in the following: how the agent can break free from the dependence on training data and achieve autonomous navigation with good generalization performance when facing changes in environments and sound sources. To address this challenge, we propose an Audio Spatially-Guided Fusion for Audio-Visual Navigation method. First, we design an audio spatial feature encoder, which adaptively extracts target-related spatial state information through an audio intensity attention mechanism; based on this, we introduce an Audio Spatial State Guided Fusion (ASGF) to achieve dynamic alignment and adaptive fusion of multimodal features, effectively alleviating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
