Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
Huanchen Cai, Sten Ternstr\"om

TL;DR
This paper introduces a metric-based voice mapping framework to evaluate TTS systems' voice quality, focusing on voice range, spectrum balance, and CPPs to assess naturalness and expressiveness.
Contribution
It proposes a novel evaluation approach using specific voice metrics to analyze and compare the vocal capabilities of various TTS models.
Findings
VITS has the largest voice range among tested models.
Glow-TTS performs best in soft phonation with higher spectrum balance.
CPPs between 7-8 dB indicate natural voice quality, above 10 dB sounds robotic.
Abstract
This study investigates voice mapping as an evaluation framework for text-to-speech (TTS) synthesis quality. The study analyzes six TTS models, including historical and recent ones. The metrics are crest factor, spectrum balance, and cepstral peak prominence (CPPs). We investigated 6 influential TTS models: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS. The results demonstrate that voice range serves as a primary indicator of model capability, with VITS showing the largest range among tested models. Glow-TTS exhibited superior performance in soft phonation, indicated by higher spectrum balance, despite limited voice range. The results showed that the CPPs values between 7-8 dB indicate natural voice quality, while with CPPs exceeding 10 dB, the speech tends to sound robotic. These findings underscore the need for voice mapping to evaluate vocal effort, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
