CATR: Combinatorial-Dependence Audio-Queried Transformer for   Audio-Visual Video Segmentation

Kexin Li; Zongxin Yang; Lei Chen; Yi Yang; Jun Xiao

arXiv:2309.09709·cs.CV·September 21, 2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao

PDF

Open Access 1 Repo

TL;DR

This paper introduces CATR, a novel audio-visual transformer that captures combined spatial-temporal dependencies and uses audio-constrained queries to improve pixel-level segmentation of sound-producing objects, achieving state-of-the-art results.

Contribution

The paper proposes a decoupled audio-video transformer with a memory-efficient block and audio-constrained queries, enhancing audio-visual dependence modeling and segmentation accuracy.

Findings

01

Achieves new SOTA performance on three datasets.

02

Effectively models combined audio-visual dependencies.

03

Improves segmentation accuracy with audio-constrained queries.

Abstract

Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aspirinone/catr.github.io
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation

Methodsfail