Transavs: End-To-End Audio-Visual Segmentation With Transformer

Yuhang Ling; Yuxi Li; Zhenye Gan; Jiangning Zhang; Mingmin Chi; Yabiao; Wang

arXiv:2305.07223·cs.SD·December 27, 2023·1 cites

Transavs: End-To-End Audio-Visual Segmentation With Transformer

Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, Yabiao, Wang

PDF

Open Access

TL;DR

TransAVS introduces a transformer-based end-to-end framework for audio-visual segmentation, effectively disentangling audio signals and improving segmentation accuracy by leveraging self-supervised learning.

Contribution

It is the first to apply transformer architecture to AVS, explicitly disentangles audio streams as queries, and employs self-supervised losses for better object distinction.

Findings

01

Achieves state-of-the-art results on AVSBench dataset.

02

Effectively disentangles audio signals for clearer segmentation.

03

Improves distinction between similar sounding objects.

Abstract

Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation