Universal Paralinguistic Speech Representations Using Self-Supervised   Conformers

Joel Shor; Aren Jansen; Wei Han; Daniel Park; Yu Zhang

arXiv:2110.04621·cs.SD·December 14, 2022

Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

Joel Shor, Aren Jansen, Wei Han, Daniel Park, Yu Zhang

PDF

Open Access

TL;DR

This paper introduces a large-scale self-supervised Conformer-based model that produces universal paralinguistic speech representations, outperforming previous methods across diverse speech understanding tasks.

Contribution

It presents a novel self-supervised training approach for a 600M+ parameter Conformer model to generate universal speech representations applicable to multiple paralinguistic tasks.

Findings

01

Linear classifiers on the representations outperform previous results.

02

2-second context windows achieve 96% of full-context performance on most tasks.

03

A single universal representation performs near-optimally across all tasks.

Abstract

Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems