AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment

Yu Chen; Hongxu Zhu; Jiadong Wang; Kainan Chen; Xinyuan Qian

arXiv:2507.07384·cs.SD·August 7, 2025

AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment

Yu Chen, Hongxu Zhu, Jiadong Wang, Kainan Chen, Xinyuan Qian

PDF

1 Video

TL;DR

This paper introduces AV-SSAN, a novel framework for audio-visual source localization that enables selective target localization using semantic prompts without requiring spatially paired data, and demonstrates its effectiveness on a new large-scale dataset.

Contribution

The paper proposes the AV-SSAN framework and MB-SSA Net for semantic-spatial alignment, addressing limitations of existing AV-SSL methods by enabling target-specific localization without spatial pairing.

Findings

01

Achieves 71.29% accuracy in target localization

02

Constructs the large-scale VGGSound-SSL dataset

03

Significantly outperforms existing AV-SSL methods

Abstract

Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize specific target sources. To address these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that localizes target sound sources using visual prompts from different instances of the same semantic class. CI-AVL enables selective localization without spatially paired data. To solve this task, we propose AV-SSAN, a semantic-spatial alignment framework centered on a Multi-Band Semantic-Spatial Alignment Network (MB-SSA Net). MB-SSA Net decomposes the audio spectrogram into multiple frequency bands, aligns each band with semantic visual prompts, and refines spatial cues to estimate the direction-of-arrival (DoA). To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AV-SSAN: Audio-Visual Selective DOA Estimation Through Explicit Multi-Band Semantic-Spatial Alignment· underline