Non-Causal to Causal SSL-Supported Transfer Learning: Towards a   High-Performance Low-Latency Speech Vocoder

Renzheng Shi; Andreas B\"ar; Marvin Sach; Wouter Tirry and; Tim Fingscheidt

arXiv:2408.11842·eess.AS·August 27, 2024·IWAENC

Non-Causal to Causal SSL-Supported Transfer Learning: Towards a High-Performance Low-Latency Speech Vocoder

Renzheng Shi, Andreas B\"ar, Marvin Sach, Wouter Tirry and, Tim Fingscheidt

PDF

Open Access

TL;DR

This paper develops a low-latency causal speech vocoder by integrating causal convolutions, transfer learning, and SSL-based representation alignment, achieving high-quality speech synthesis comparable to high-delay models.

Contribution

It introduces a novel transfer learning scheme and SSL-based alignment to enhance low-latency causal vocoders, bridging the performance gap with non-causal models.

Findings

01

Causal vocoder achieves 3.96 PESQ, surpassing the original non-causal BigVGAN.

02

Transfer learning and SSL alignment significantly improve low-latency vocoder performance.

03

The proposed model maintains high speech quality with only 21% increased complexity.

Abstract

Recently, BigVGAN has emerged as high-performance speech vocoder. Its sequence-to-sequence-based synthesis, however, prohibits usage in low-latency conversational applications. Our work addresses this shortcoming in three steps. First, we introduce low latency into BigVGAN via implementing causal convolutions, yielding decreased performance. Second, to regain performance, we propose a teacher-student transfer learning scheme to distill the high-delay non-causal BigVGAN into our low-latency causal vocoder. Third, taking advantage of a self-supervised learning (SSL) model, in our case wav2vec 2.0, we align its encoder speech representations extracted from our low-latency causal vocoder to the ground truth ones. In speaker-independent settings, both proposed training schemes notably elevate the performance of our low-latency vocoder, closing up to the original high-delay BigVGAN. At only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing

MethodsALIGN