Controllable Neural Prosody Synthesis
Max Morrison, Zeyu Jin, Justin Salamon, Nicholas J. Bryan, Gautham J., Mysore

TL;DR
This paper introduces a neural prosody generator that enables user control over speech prosody, allowing correction of prosody errors and diverse emotion and excitement levels, while maintaining naturalness.
Contribution
It presents a novel user-controllable, context-aware neural prosody generator and a pitch-shifting vocoder to modify speech prosody effectively.
Findings
Successful incorporation of user control without losing naturalness
Effective correction of prosody errors in synthesized speech
Enhanced diversity in speaker emotions and excitement levels
Abstract
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
