SAM Audio: Segment Anything in Audio

Bowen Shi; Andros Tjandra; John Hoffman; Helin Wang; Yi-Chiao Wu; Luya Gao; Julius Richter; Matt Le; Apoorv Vyas; Sanyuan Chen; Christoph Feichtenhofer; Piotr Doll\'ar; Wei-Ning Hsu; Ann Lee

arXiv:2512.18099·eess.AS·December 24, 2025

SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Doll\'ar, Wei-Ning Hsu, Ann Lee

PDF

Open Access 9 Models 1 Datasets

TL;DR

SAM Audio is a versatile foundation model for general audio source separation that supports multiple prompting modalities and achieves state-of-the-art results across diverse audio benchmarks.

Contribution

It introduces a unified framework for audio separation using text, visual, and temporal prompts, trained on large-scale data with a diffusion transformer architecture.

Findings

01

Achieves state-of-the-art performance on multiple audio separation benchmarks.

02

Supports flexible prompting modalities including language, visual masks, and temporal spans.

03

Introduces a new real-world benchmark with human-labeled multimodal prompts.

Abstract

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

facebook/sam-audio-bench
dataset· 156 dl
156 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis