Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis   Using the Rapid Prosody Transcription Paradigm

Elijah Gutierrez; Pilar Oplustil-Gallegos; Catherine Lai

arXiv:2107.02527·eess.AS·July 7, 2021·1 cites

Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Elijah Gutierrez, Pilar Oplustil-Gallegos, Catherine Lai

PDF

Open Access

TL;DR

This paper introduces a novel evaluation method for text-to-speech systems using the Rapid Prosody Transcription paradigm, enabling detailed error localization and analysis of prosodic features beyond traditional MOS scores.

Contribution

It proposes a new real-time error marking approach for TTS evaluation that provides detailed insights into prosodic errors and their relation to system performance.

Findings

01

Error marks cluster around prosodic boundaries in audiobook samples.

02

The method correlates with MOS-based rankings but offers more detailed error localization.

03

Differences in prosody generation are observed across TTS systems in question-answer stimuli.

Abstract

Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems