SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Yuji Wang; Haoran Xu; Yong Liu; Jiaze Li; Yansong Tang

arXiv:2506.01558·cs.CV·June 3, 2025

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Yuji Wang, Haoran Xu, Yong Liu, Jiaze Li, Yansong Tang

PDF

Open Access

TL;DR

SAM2-LOVE introduces a novel multimodal framework that effectively integrates text, audio, and visual data to improve pixel-wise scene understanding in language-aided audio-visual segmentation, addressing previous modality limitations.

Contribution

The paper presents SAM2-LOVE, a new framework that combines three modalities with a learnable token, improving spatio-temporal consistency and outperforming state-of-the-art methods in Ref-AVS tasks.

Findings

01

Outperforms SOTA by 8.5% in J&F on Ref-AVS benchmark

02

Enhances spatio-temporal consistency without losing historical info

03

Demonstrates effectiveness of multimodal fusion and token strategies

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsALIGN