Extracting speaker and emotion information from self-supervised speech   models via channel-wise correlations

Themos Stafylakis; Ladislav Mosner; Sofoklis Kakouros; Oldrich Plchot,; Lukas Burget; Jan Cernocky

arXiv:2210.09513·eess.AS·October 19, 2022·1 cites

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot,, Lukas Burget, Jan Cernocky

PDF

Open Access

TL;DR

This paper explores a novel correlation pooling method to extract speaker and emotion information from self-supervised speech models, demonstrating improved performance over traditional mean pooling techniques.

Contribution

It introduces correlation pooling as an alternative to mean pooling for aggregating speech representations, showing enhanced extraction of speaker and emotion features.

Findings

01

Correlation pooling outperforms mean pooling in extracting speaker and emotion info

02

Fusion of pooling methods yields further performance gains

03

Code implementation is publicly available for reproducibility

Abstract

Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing