Enhancing the Stability of LLM-based Speech Generation Systems through   Self-Supervised Representations

\'Alvaro Mart\'in-Cortinas; Daniel S\'aez-Trigueros; Iv\'an; Vall\'es-P\'erez; Biel Tura-Vecino; Piotr Bili\'nski; Mateusz Lajszczak,; Grzegorz Beringer; Roberto Barra-Chicote; Jaime Lorenzo-Trueba

arXiv:2402.03407·eess.AS·February 7, 2024·1 cites

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

\'Alvaro Mart\'in-Cortinas, Daniel S\'aez-Trigueros, Iv\'an, Vall\'es-P\'erez, Biel Tura-Vecino, Piotr Bili\'nski, Mateusz Lajszczak,, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba

PDF

Open Access

TL;DR

This paper introduces a self-supervised voice conversion approach to improve the stability and naturalness of large language model-based speech generation, effectively disentangling speaker identity from content for better performance.

Contribution

The work presents a novel self-supervised VC architecture that learns speaker-disentangled representations, enhancing LLM-based speech synthesis stability and quality.

Findings

01

4.7pp improvement in speaker similarity over SOTA

02

5.4pp lower WER compared to entangled representations

03

Higher naturalness than human recordings

Abstract

Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skipping or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) architecture which can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations. Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model. Results show that LLMs trained over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques