Triple M: A Practical Text-to-speech Synthesis System With   Multi-guidance Attention And Multi-band Multi-time LPCNet

Shilun Lin; Fenglong Xie; Li Meng; Xinhui Li; Li Lu

arXiv:2102.00247·cs.CL·April 8, 2021·1 cites

Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time LPCNet

Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu

PDF

Open Access

TL;DR

This paper introduces Triple M, a TTS system combining multi-guidance attention for improved naturalness and a multi-band multi-time vocoder for enhanced efficiency, suitable for large-scale online deployment.

Contribution

The paper presents a novel multi-guidance attention mechanism and an efficient multi-band multi-time LPCNet vocoder, advancing TTS quality and computational efficiency.

Findings

01

26.8% reduction in word error rate

02

Speeds up LPCNet by 2.75 times on CPU

03

Reduces computational complexity from 2.8 to 1.0 GFLOP

Abstract

In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to transfer complementary advantages from guiding attention mechanisms to the basic attention mechanism without in-domain performance loss and online service modification. Compared with single attention mechanism, multi-guidance attention not only brings better naturalness to long sentence synthesis, but also reduces the word error rate by 26.8%. 2) A new efficient multi-band multi-time vocoder framework, which reduces the computational complexity from 2.8 to 1.0 GFLOP and speeds up LPCNet by 2.75x on a single CPU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

Methodstravel james · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence