SAM Audio: Segment Anything in Audio
Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Doll\'ar, Wei-Ning Hsu, Ann Lee

TL;DR
SAM Audio is a versatile foundation model for general audio source separation that supports multiple prompting modalities and achieves state-of-the-art results across diverse audio benchmarks.
Contribution
It introduces a unified framework for audio separation using text, visual, and temporal prompts, trained on large-scale data with a diffusion transformer architecture.
Findings
Achieves state-of-the-art performance on multiple audio separation benchmarks.
Supports flexible prompting modalities including language, visual masks, and temporal spans.
Introduces a new real-world benchmark with human-labeled multimodal prompts.
Abstract
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/sam-audio-largemodel· 22k dl· ♡ 38522k dl♡ 385
- 🤗facebook/sam-audio-judgemodel· 50k dl· ♡ 2950k dl♡ 29
- 🤗facebook/sam-audio-smallmodel· 6.2k dl· ♡ 806.2k dl♡ 80
- 🤗facebook/sam-audio-basemodel· 2.9k dl· ♡ 482.9k dl♡ 48
- 🤗facebook/sam-audio-large-tvmodel· 453 dl· ♡ 24453 dl♡ 24
- 🤗facebook/sam-audio-base-tvmodel· 187 dl· ♡ 10187 dl♡ 10
- 🤗facebook/sam-audio-small-tvmodel· 97 dl· ♡ 1197 dl♡ 11
- 🤗jetjodh/sam-audio-judgemodel· 4 dl4 dl
- 🤗Creador301/sam-audio-largemodel· 13 dl13 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
