Progressive Residual Extraction based Pre-training for Speech   Representation Learning

Tianrui Wang; Jin Li; Ziyang Ma; Rui Cao; Xie Chen; Longbiao Wang,; Meng Ge; Xiaobao Wang; Yuguang Wang; Jianwu Dang; Nyima Tashi

arXiv:2409.00387·eess.AS·September 4, 2024

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Tianrui Wang, Jin Li, Ziyang Ma, Rui Cao, Xie Chen, Longbiao Wang,, Meng Ge, Xiaobao Wang, Yuguang Wang, Jianwu Dang, Nyima Tashi

PDF

Open Access

TL;DR

This paper introduces ProgRE, a progressive residual extraction method for speech SSL that enhances task-specific representations by isolating pitch, speaker, and content information, improving performance across diverse speech tasks.

Contribution

The paper proposes a novel progressive residual extraction approach with specialized modules, enabling better separation of speech features for multiple downstream tasks.

Findings

01

Improved performance on speaker identification, speech recognition, and emotion recognition.

02

Effective extraction of pitch, speaker, and content features.

03

Outperforms existing SSL methods like wav2vec2.0, HuBERT, and WavLM.

Abstract

Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progressive residual extraction based self-supervised learning method, named ProgRE. Specifically, we introduce two lightweight and specialized task modules into an encoder-style SSL backbone to enhance its ability to extract pitch variation and speaker information from speech. Furthermore, to prevent the interference of reinforced pitch variation and speaker information with irrelevant content information learning, we residually remove the information extracted by these two modules from the main branch. The main branch is then trained using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need