Progressive Residual Extraction based Pre-training for Speech Representation Learning
Tianrui Wang, Jin Li, Ziyang Ma, Rui Cao, Xie Chen, Longbiao Wang,, Meng Ge, Xiaobao Wang, Yuguang Wang, Jianwu Dang, Nyima Tashi

TL;DR
This paper introduces ProgRE, a progressive residual extraction method for speech SSL that enhances task-specific representations by isolating pitch, speaker, and content information, improving performance across diverse speech tasks.
Contribution
The paper proposes a novel progressive residual extraction approach with specialized modules, enabling better separation of speech features for multiple downstream tasks.
Findings
Improved performance on speaker identification, speech recognition, and emotion recognition.
Effective extraction of pitch, speaker, and content features.
Outperforms existing SSL methods like wav2vec2.0, HuBERT, and WavLM.
Abstract
Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progressive residual extraction based self-supervised learning method, named ProgRE. Specifically, we introduce two lightweight and specialized task modules into an encoder-style SSL backbone to enhance its ability to extract pitch variation and speaker information from speech. Furthermore, to prevent the interference of reinforced pitch variation and speaker information with irrelevant content information learning, we residually remove the information extracted by these two modules from the main branch. The main branch is then trained using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
