QDFormer: Towards Robust Audiovisual Segmentation in Complex   Environments with Quantization-based Semantic Decomposition

Xiang Li; Jinglu Wang; Xiaohao Xu; Xiulian Peng; Rita Singh; Yan Lu,; Bhiksha Raj

arXiv:2310.00132·cs.CV·April 22, 2024·1 cites

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

Xiang Li, Jinglu Wang, Xiaohao Xu, Xiulian Peng, Rita Singh, Yan Lu,, Bhiksha Raj

PDF

Open Access 3 Repos

TL;DR

QDFormer introduces a quantization-based semantic decomposition approach to improve audiovisual segmentation robustness in complex environments by disentangling multi-source audio semantics and distilling stable global features.

Contribution

The paper proposes a novel semantic decomposition method using product quantization and a global-to-local mechanism to enhance AVS performance in challenging scenarios.

Findings

01

Achieved +21.2% mIoU on AVS-Semantic benchmark.

02

Significantly improved robustness in complex audiovisual environments.

03

Demonstrated effectiveness of semantic decomposition in AVS tasks.

Abstract

Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved, establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence of distinct sound events. Assuming sound events occur independently, the multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces. We are motivated to decompose the multi-source audio semantics into single-source semantics for more effective interactions with visual content. We propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several disentangled and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Image and Signal Denoising Methods