Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with   Acoustic and Textual Contexts

Detai Xin; Sharath Adavanne; Federico Ang; Ashish Kulkarni; Shinnosuke; Takamichi; Hiroshi Saruwatari

arXiv:2211.02336·cs.SD·November 7, 2022

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke, Takamichi, Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces a multi-modal context-aware Japanese audiobook TTS system that enhances prosody by integrating acoustic and textual contexts, outperforming previous methods through extensive evaluations.

Contribution

It proposes a novel multimodal context encoding approach for TTS that incorporates both acoustic and textual information to improve prosody, with new insights on context modality choices.

Findings

01

Significant improvement over previous TTS methods in prosody quality.

02

Effective use of multimodal context encoding for speech synthesis.

03

Insights into modality, lateral information, and context length choices.

Abstract

We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing