Towards Expressive Speaking Style Modelling with Hierarchical Context   Information for Mandarin Speech Synthesis

Shun Lei; Yixuan Zhou; Liyang Chen; Zhiyong Wu; Shiyin Kang; Helen; Meng

arXiv:2203.12201·cs.SD·April 7, 2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen, Meng

PDF

Open Access

TL;DR

This paper introduces a hierarchical context-aware framework with a novel training strategy for Mandarin speech synthesis, significantly enhancing expressiveness and naturalness by modeling broader contextual information.

Contribution

It proposes a hierarchical context encoder and a knowledge distillation training strategy to better capture speech style from wider context in Mandarin synthesis.

Findings

01

Improved naturalness and expressiveness in synthesized speech.

02

Significant gains demonstrated through objective and subjective evaluations.

Abstract

Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques