How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
Yujian Lee, Peng Gao, Yongqi Xu, Wentao Fan

TL;DR
This paper introduces SSP, a novel framework that combines optical flow and textual prompts to improve audio-visual semantic segmentation by capturing motion and scene context, outperforming existing methods.
Contribution
The paper proposes a new collaborative framework, SSP, integrating optical flow and textual prompts with a visual-textual alignment module for enhanced AVSS performance.
Findings
SSP outperforms existing AVS methods in segmentation accuracy.
Optical flow captures motion dynamics for moving objects.
Textual prompts help identify stationary sound sources.
Abstract
Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing
