Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis
Th\'eodor Lemerle, T\'eo Guichoux, Axel Roebel, Nicolas Obin

TL;DR
Lina-Speech introduces Gated Linear Attention and Initial-State Tuning to enhance multi-sample prompt-based TTS, enabling better voice cloning, style, and emotion adaptation with improved inference efficiency.
Contribution
The paper presents a novel TTS model using Gated Linear Attention and a stateful tuning strategy for flexible, efficient voice cloning and style transfer from multiple speech samples.
Findings
Improved inference throughput with Gated Linear Attention.
Effective multi-sample conditioning for voice cloning.
Enhanced control over prosody and emotion.
Abstract
Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker's prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
