QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
Xiang Li, Jinglu Wang, Xiaohao Xu, Xiulian Peng, Rita Singh, Yan Lu,, Bhiksha Raj

TL;DR
QDFormer introduces a quantization-based semantic decomposition approach to improve audiovisual segmentation robustness in complex environments by disentangling multi-source audio semantics and distilling stable global features.
Contribution
The paper proposes a novel semantic decomposition method using product quantization and a global-to-local mechanism to enhance AVS performance in challenging scenarios.
Findings
Achieved +21.2% mIoU on AVS-Semantic benchmark.
Significantly improved robustness in complex audiovisual environments.
Demonstrated effectiveness of semantic decomposition in AVS tasks.
Abstract
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved, establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence of distinct sound events. Assuming sound events occur independently, the multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces. We are motivated to decompose the multi-source audio semantics into single-source semantics for more effective interactions with visual content. We propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several disentangled and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Image and Signal Denoising Methods
