SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
Yuji Wang, Haoran Xu, Yong Liu, Jiaze Li, Yansong Tang

TL;DR
SAM2-LOVE introduces a novel multimodal framework that effectively integrates text, audio, and visual data to improve pixel-wise scene understanding in language-aided audio-visual segmentation, addressing previous modality limitations.
Contribution
The paper presents SAM2-LOVE, a new framework that combines three modalities with a learnable token, improving spatio-temporal consistency and outperforming state-of-the-art methods in Ref-AVS tasks.
Findings
Outperforms SOTA by 8.5% in J&F on Ref-AVS benchmark
Enhances spatio-temporal consistency without losing historical info
Demonstrates effectiveness of multimodal fusion and token strategies
Abstract
Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsALIGN
