Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Kenichi Fujita; Takanori Ashihara; Marc Delcroix; and Yusuke Ijima

arXiv:2407.01291·cs.SD·July 2, 2024

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Kenichi Fujita, Takanori Ashihara, Marc Delcroix, and Yusuke Ijima

PDF

Open Access

TL;DR

This paper introduces a lightweight zero-shot text-to-speech method using a mixture of adapters, enabling high-quality speaker adaptation with fewer parameters and faster inference.

Contribution

The paper proposes integrating a mixture of adapters into a non-autoregressive TTS model for efficient zero-shot speaker adaptation.

Findings

01

Achieves better speech quality than baseline

02

Uses less than 40% of parameters of large models

03

Provides 1.9x faster inference speed

Abstract

The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40\% of parameters at 1.9 times faster inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsAdapter