USAD: Universal Speech and Audio Representation via Distillation

Heng-Jui Chang; Saurabhchand Bhati; James Glass; Alexander H. Liu

arXiv:2506.18843·cs.SD·August 19, 2025

USAD: Universal Speech and Audio Representation via Distillation

Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu

PDF

Open Access 3 Models

TL;DR

USAD introduces a unified audio representation model trained via distillation from domain-specific SSL models, effectively handling speech, sound, and music with competitive performance across multiple benchmarks.

Contribution

It is the first to unify diverse audio domains into a single SSL-based model using layer-to-layer distillation, enhancing versatility and efficiency.

Findings

01

Achieves near state-of-the-art results on SUPERB and HEAR benchmarks.

02

Performs well across speech, sound, and music tasks.

03

Uses efficient distillation from multiple SSL models.

Abstract

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis