CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao

TL;DR
This paper introduces CATR, a novel audio-visual transformer that captures combined spatial-temporal dependencies and uses audio-constrained queries to improve pixel-level segmentation of sound-producing objects, achieving state-of-the-art results.
Contribution
The paper proposes a decoupled audio-video transformer with a memory-efficient block and audio-constrained queries, enhancing audio-visual dependence modeling and segmentation accuracy.
Findings
Achieves new SOTA performance on three datasets.
Effectively models combined audio-visual dependencies.
Improves segmentation accuracy with audio-constrained queries.
Abstract
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation
Methodsfail
