How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Yujian Lee; Peng Gao; Yongqi Xu; Wentao Fan

arXiv:2601.08133·cs.CV·March 3, 2026

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Yujian Lee, Peng Gao, Yongqi Xu, Wentao Fan

PDF

Open Access

TL;DR

This paper introduces SSP, a novel framework that combines optical flow and textual prompts to improve audio-visual semantic segmentation by capturing motion and scene context, outperforming existing methods.

Contribution

The paper proposes a new collaborative framework, SSP, integrating optical flow and textual prompts with a visual-textual alignment module for enhanced AVSS performance.

Findings

01

SSP outperforms existing AVS methods in segmentation accuracy.

02

Optical flow captures motion dynamics for moving objects.

03

Textual prompts help identify stationary sound sources.

Abstract

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing