TL;DR
AuralSAM2 enhances SAM2's promptable segmentation by integrating audio cues through a novel fusion method, improving accuracy while maintaining efficiency in video analysis.
Contribution
The paper introduces AuralSAM2, a new approach that fuses audio and visual features in SAM2, with a novel prompt generation and an audio-guided contrastive loss.
Findings
Achieves notable accuracy improvements on public benchmarks.
Maintains interactive efficiency of promptable segmentation.
Effectively propagates auditory cues across visual layers.
Abstract
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
