Speech Representation Learning Through Self-supervised Pretraining And   Multi-task Finetuning

Yi-Chen Chen; Shu-wen Yang; Cheng-Kuang Lee; Simon See; Hung-yi Lee

arXiv:2110.09930·eess.AS·October 20, 2021·5 cites

Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning

Yi-Chen Chen, Shu-wen Yang, Cheng-Kuang Lee, Simon See, Hung-yi Lee

PDF

Open Access

TL;DR

This paper explores combining self-supervised pretraining with multi-task finetuning to enhance speech representation learning, demonstrating that MTL finetuning can improve SSL models and generalize to new tasks.

Contribution

It introduces a systematic study of supervised multi-task finetuning on SSL pretrained speech models, showing its benefits and generalizability.

Findings

01

MTL finetuning improves SSL speech representations

02

Supervised MTL enhances model performance on downstream tasks

03

Representation learned by MTL generalizes to unseen tasks

Abstract

Speech representation learning plays a vital role in speech processing. Among them, self-supervised learning (SSL) has become an important research direction. It has been shown that an SSL pretraining model can achieve excellent performance in various downstream tasks of speech processing. On the other hand, supervised multi-task learning (MTL) is another representation learning paradigm, which has been proven effective in computer vision (CV) and natural language processing (NLP). However, there is no systematic research on the general representation learning model trained by supervised MTL in speech processing. In this paper, we show that MTL finetuning can further improve SSL pretraining. We analyze the generalizability of supervised MTL finetuning to examine if the speech representation learned by MTL finetuning can generalize to unseen new tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems