TRILLsson: Distilled Universal Paralinguistic Speech Representations

Joel Shor; Subhashini Venugopalan

arXiv:2203.00236·eess.AS·December 20, 2022·1 cites

TRILLsson: Distilled Universal Paralinguistic Speech Representations

Joel Shor, Subhashini Venugopalan

PDF

Open Access

TL;DR

TRILLsson introduces a set of small, efficient, and high-performing paralinguistic speech models distilled from larger models, enabling deployment on resource-constrained devices while maintaining competitive accuracy.

Contribution

The paper presents a collection of publicly available, distilled paralinguistic speech models that are significantly smaller yet nearly as accurate as larger models, using knowledge distillation on public data.

Findings

01

Largest model is 15% the size of the original, with 96% accuracy on most tasks.

02

Smallest model is 1% the size, achieving over 90% accuracy.

03

Models outperform open-source Wav2Vec 2.0 on 6 of 7 tasks.

Abstract

Recent advances in self-supervision have dramatically improved the quality of speech representations. However, deployment of state-of-the-art embedding models on devices has been restricted due to their limited public availability and large resource footprint. Our work addresses these issues by publicly releasing a collection of paralinguistic speech models that are small and near state-of-the-art performance. Our approach is based on knowledge distillation, and our models are distilled on public data only. We explore different architectures and thoroughly evaluate our models on the Non-Semantic Speech (NOSS) benchmark. Our largest distilled model is less than 15% the size of the original model (314MB vs 2.2GB), achieves over 96% the accuracy on 6 of 7 tasks, and is trained on 6.5% the data. The smallest model is 1% in size (22MB) and achieves over 90% the accuracy on 6 of 7 tasks. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and dialogue systems