Syntactic representation learning for neural network based TTS with syntactic parse tree traversal
Changhe Song, Jingbei Li, Yixuan Zhou, Zhiyong Wu, Helen Meng

TL;DR
This paper introduces a novel method for automatically learning syntactic representations from parse trees to improve neural TTS systems, resulting in more natural speech synthesis.
Contribution
It proposes a syntactic representation learning approach using parse tree traversal and GRU networks, enhancing prosody and naturalness in TTS without manual feature design.
Findings
MOS increased from 3.70 to 3.82
ABX preference exceeded baseline by 17%
Prosodic differences are perceptible in multi-parse sentences
Abstract
Syntactic structure of a sentence text is correlated with the prosodic structure of the speech that is crucial for improving the prosody and naturalness of a text-to-speech (TTS) system. Nowadays TTS systems usually try to incorporate syntactic structure information with manually designed features based on expert knowledge. In this paper, we propose a syntactic representation learning method based on syntactic parse tree traversal to automatically utilize the syntactic structure information. Two constituent label sequences are linearized through left-first and right-first traversals from constituent parse tree. Syntactic representations are then extracted at word level from each constituent label sequence by a corresponding uni-directional gated recurrent unit (GRU) network. Meanwhile, nuclear-norm maximization loss is introduced to enhance the discriminability and diversity of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
