On incorporating social speaker characteristics in synthetic speech
Sai Sirisha Rallabandi, Sebastian M\"oller

TL;DR
This paper explores how incorporating specific vocal features into speech synthesis can enhance perceived warmth and competence, demonstrating that convex combinations of features improve listener ratings.
Contribution
It introduces a method to integrate derived vocal features into Tacotron-based synthesis to better emulate social speaker characteristics.
Findings
Convex combinations of features yield higher warmth and competence scores.
Spectral flux and F1/F2 means influence perceived social traits.
Listening tests confirm improved perception with combined features.
Abstract
In our previous work, we derived the acoustic features, that contribute to the perception of warmth and competence in synthetic speech. As an extension, in our current work, we investigate the impact of the derived vocal features in the generation of the desired characteristics. The acoustic features, spectral flux, F1 mean and F2 mean and their convex combinations were explored for the generation of higher warmth in female speech. The voiced slope, spectral flux, and their convex combinations were investigated for the generation of higher competence in female speech. We have employed a feature quantization approach in the traditional end-to-end tacotron based speech synthesis model. The listening tests have shown that the convex combination of acoustic features displays higher Mean Opinion Scores of warmth and competence when compared to that of individual features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
