USAD: Universal Speech and Audio Representation via Distillation
Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu

TL;DR
USAD introduces a unified audio representation model trained via distillation from domain-specific SSL models, effectively handling speech, sound, and music with competitive performance across multiple benchmarks.
Contribution
It is the first to unify diverse audio domains into a single SSL-based model using layer-to-layer distillation, enhancing versatility and efficiency.
Findings
Achieves near state-of-the-art results on SUPERB and HEAR benchmarks.
Performs well across speech, sound, and music tasks.
Uses efficient distillation from multiple SSL models.
Abstract
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
