Audio Visual Segmentation Through Text Embeddings

Kyungbok Lee; You Zhang; Zhiyao Duan

arXiv:2502.16359·cs.CV·May 30, 2025

Audio Visual Segmentation Through Text Embeddings

Kyungbok Lee, You Zhang, Zhiyao Duan

PDF

Open Access 1 Repo

TL;DR

This paper introduces AV2T-SAM, a novel framework that enhances audio-visual segmentation by aligning audio features with text embeddings from pre-trained models, leveraging multimodal data to improve segmentation accuracy.

Contribution

The paper proposes AV2T-SAM, which bridges audio features with text embeddings in pre-trained segmentation models, addressing data scarcity and improving AVS performance.

Findings

01

Outperforms existing methods on AVSBench dataset

02

Effectively utilizes pre-trained segmentation models and cross-modal semantic alignment

03

Introduces a new feature emphasizing shared semantics of audio and visual modalities

Abstract

The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bok-bok/av2t-sam
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech Recognition and Synthesis

MethodsSegment Anything Model