Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences   and Paragraphs

Rob Clark; Hanna Silen; Tom Kenter; Ralph Leith

arXiv:1909.03965·eess.AS·September 10, 2019·1 cites

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Rob Clark, Hanna Silen, Tom Kenter, Ralph Leith

PDF

Open Access

TL;DR

This paper examines three different evaluation methods for long-form text-to-speech systems, revealing that traditional sentence-level evaluation is insufficient and multiple methods are necessary for accurate assessment.

Contribution

It introduces and compares three evaluation approaches for long-form speech synthesis, highlighting their differences and the need for multiple assessments.

Findings

01

Evaluation outcomes differ across methods

02

Results do not always correlate between methods

03

Multiple evaluation approaches are necessary for accurate assessment

Abstract

Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing. In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech synthesis. We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of speech, and presenting a selection of speech or text as context and evaluating the subsequent speech. We find that, even though these three evaluations are based upon the same material, the outcomes differ per setting, and moreover that these outcomes do not necessarily correlate with each other. We show that our findings are consistent between a single speaker setting of read…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques