Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts
Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke, Takamichi, Hiroshi Saruwatari

TL;DR
This paper introduces a multi-modal context-aware Japanese audiobook TTS system that enhances prosody by integrating acoustic and textual contexts, outperforming previous methods through extensive evaluations.
Contribution
It proposes a novel multimodal context encoding approach for TTS that incorporates both acoustic and textual information to improve prosody, with new insights on context modality choices.
Findings
Significant improvement over previous TTS methods in prosody quality.
Effective use of multimodal context encoding for speech synthesis.
Insights into modality, lateral information, and context length choices.
Abstract
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
