TacoLM: GaTed Attention Equipped Codec Language Model are Efficient   Zero-Shot Text to Speech Synthesizers

Yakun Song; Zhuo Chen; Xiaofei Wang; Ziyang Ma; Guanrou Yang; Xie Chen

arXiv:2406.15752·eess.AS·June 25, 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen

PDF

Open Access 1 Repo

TL;DR

TacoLM is an efficient neural codec language model with gated attention mechanisms that significantly improves zero-shot text-to-speech synthesis speed, stability, and accuracy while reducing model size.

Contribution

It introduces gated attention and cross-attention layers to enhance efficiency and content accuracy in neural codec language models for TTS.

Findings

01

Achieves 90% fewer parameters than VALL-E

02

Speeds up inference by 5.2 times

03

Improves word error rate and speaker similarity

Abstract

Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ereboas/TacoLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings