Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis   with Context-Aware Contrastive Language-Audio Pretraining

Jinlong Xue; Yayue Deng; Yingming Gao; Ya Li

arXiv:2406.03714·cs.SD·June 7, 2024·1 cites

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li

PDF

Open Access

TL;DR

This paper introduces a retrieval-augmented approach for prompt-based text-to-speech synthesis that leverages context-aware features to improve speaker cloning and style transfer, outperforming existing methods.

Contribution

It adapts retrieval augmented generation to TTS with context-aware contrastive pretraining, enhancing prompt selection and speech synthesis quality.

Findings

01

RAG method outperforms baselines in TTS tasks.

02

CA-CLAP achieves better style-related feature extraction.

03

Improved subjective and objective evaluation results.

Abstract

Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems