Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade, Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

TL;DR
This paper introduces a method that uses synthetic speech generated by a TTS system to significantly reduce the amount of real speech data needed for effective self-supervised learning in speech processing, achieving high performance with minimal data.
Contribution
It presents a novel approach that leverages SSL-enhanced TTS to augment low-resource pre-training datasets, substantially reducing data requirements in speech SSL tasks.
Findings
Reduces speech data needs by 90%
Maintains performance with minimal data
First to enhance low-resource SSL with synthetic speech
Abstract
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
