Transavs: End-To-End Audio-Visual Segmentation With Transformer
Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, Yabiao, Wang

TL;DR
TransAVS introduces a transformer-based end-to-end framework for audio-visual segmentation, effectively disentangling audio signals and improving segmentation accuracy by leveraging self-supervised learning.
Contribution
It is the first to apply transformer architecture to AVS, explicitly disentangles audio streams as queries, and employs self-supervised losses for better object distinction.
Findings
Achieves state-of-the-art results on AVSBench dataset.
Effectively disentangles audio signals for clearer segmentation.
Improves distinction between similar sounding objects.
Abstract
Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
