Spatial-Aware Conditioned Fusion for Audio-Visual Navigation

Shaohang Wu; Yinfeng Yu

arXiv:2604.02390·cs.SD·April 6, 2026

Spatial-Aware Conditioned Fusion for Audio-Visual Navigation

Shaohang Wu, Yinfeng Yu

PDF

TL;DR

This paper introduces SACF, a novel method for audio-visual navigation that discretizes target position and uses spatial descriptors to improve navigation efficiency and generalization.

Contribution

The paper proposes a new spatial-aware fusion technique that explicitly models target position and enhances feature fusion for better navigation performance.

Findings

01

SACF achieves higher navigation efficiency.

02

SACF generalizes well to unseen sounds.

03

SACF reduces computational overhead.

Abstract

Audio-visual navigation tasks require agents to locate and navigate toward continuously vocalizing targets using only visual observations and acoustic cues. However, existing methods mainly rely on simple feature concatenation or late fusion, and lack an explicit discrete representation of the target's relative position, which limits learning efficiency and generalization. We propose Spatial-Aware Conditioned Fusion (SACF). SACF first discretizes the target's relative direction and distance from audio-visual cues, predicts their distributions, and encodes them as a compact descriptor for policy conditioning and state modeling. Then, SACF uses audio embeddings and spatial descriptors to generate channel-wise scaling and bias to modulate visual features via conditional linear transformation, producing target-oriented fused representations. SACF improves navigation efficiency with lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.