How do Voices from Past Speech Synthesis Challenges Compare Today?
Erica Cooper, Junichi Yamagishi

TL;DR
This paper revisits past speech synthesis challenges, conducts a large-scale listening test on combined samples, and analyzes how opinions and quality perceptions have evolved over time.
Contribution
It provides a comprehensive comparison of past speech synthesis systems and insights into how perceptions of quality change across different challenges and speakers.
Findings
Strong correlation between original and new test results at system level
Speaker choice significantly impacts synthesis quality
Historical challenge data is valuable for ongoing research
Abstract
Shared challenges provide a venue for comparing systems trained on common data using a standardized evaluation, and they also provide an invaluable resource for researchers when the data and evaluation results are publicly released. The Blizzard Challenge and Voice Conversion Challenge are two such challenges for text-to-speech synthesis and for speaker conversion, respectively, and their publicly-available system samples and listening test results comprise a historical record of state-of-the-art synthesis methods over the years. In this paper, we revisit these past challenges and conduct a large-scale listening test with samples from many challenges combined. Our aims are to analyze and compare opinions of a large number of systems together, to determine whether and how opinions change over time, and to collect a large-scale dataset of a diverse variety of synthetic samples and their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling
