DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Jingqi Tian; Yiheng Du; Haoji Zhang; Yuji Wang; Isaac Ning Lee; Xulong Bai; Tianrui Zhu; Jingxuan Niu; Yansong Tang

arXiv:2512.20117·cs.CV·December 24, 2025

DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

Jingqi Tian, Yiheng Du, Haoji Zhang, Yuji Wang, Isaac Ning Lee, Xulong Bai, Tianrui Zhu, Jingxuan Niu, Yansong Tang

PDF

Open Access

TL;DR

DDAVS introduces a novel framework for audio-visual segmentation that disentangles audio semantics and improves alignment, effectively addressing multi-source entanglement and misalignment issues to enhance segmentation accuracy.

Contribution

The paper proposes DDAVS, a new method that uses learnable queries and delayed bidirectional attention to disentangle audio semantics and improve multimodal alignment in AVS.

Findings

01

Outperforms existing methods on AVS-Objects and VPO benchmarks.

02

Demonstrates robustness in single-source, multi-source, and multi-instance scenarios.

03

Achieves superior segmentation accuracy and generalization.

Abstract

Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration