AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu; Yuanhong Chen; Chong Wang; Junlin Han; Junde Wu; Can Peng; Jingkun Chen; Yu Tian; Gustavo Carneiro

arXiv:2506.01015·cs.CV·May 15, 2026

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro

PDF

1 Repo 1 Models

TL;DR

AuralSAM2 enhances SAM2's promptable segmentation by integrating audio cues through a novel fusion method, improving accuracy while maintaining efficiency in video analysis.

Contribution

The paper introduces AuralSAM2, a new approach that fuses audio and visual features in SAM2, with a novel prompt generation and an audio-guided contrastive loss.

Findings

01

Achieves notable accuracy improvements on public benchmarks.

02

Maintains interactive efficiency of promptable segmentation.

03

Effectively propagates auditory cues across visual layers.

Abstract

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yyliu01/AuralSAM2
github

Models

🤗
yyliu01/AuralSAM2
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.