Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Shaofei Huang; Rui Ling; Tianrui Hui; Hongyu Li; Xu Zhou; Shifeng Zhang; Si Liu; Richang Hong; Meng Wang

arXiv:2506.23623·cs.CV·July 1, 2025

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang

PDF

Open Access

TL;DR

This paper introduces a vision-centric Transformer framework for audio-visual segmentation that improves object distinction and contour accuracy by leveraging vision-derived queries and prototype prompting, achieving state-of-the-art results.

Contribution

The paper proposes a novel vision-centric Transformer with a prototype prompted query generation module for enhanced audio-visual segmentation.

Findings

01

Achieves new state-of-the-art performance on AVSBench datasets.

02

Effectively distinguishes sound-producing objects from mixed audio.

03

Improves contour delineation accuracy in video segmentation.

Abstract

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization