An ASR Guided Speech Intelligibility Measure for TTS Model Selection

Arun Baby; Saranya Vinnaitherthan; Nagaraj Adiga; Pranav Jawale,; Sumukh Badam; Sharath Adavanne; Srikanth Konjeti

arXiv:2006.01463·cs.SD·June 3, 2020·5 cites

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

Arun Baby, Saranya Vinnaitherthan, Nagaraj Adiga, Pranav Jawale,, Sumukh Badam, Sharath Adavanne, Srikanth Konjeti

PDF

Open Access

TL;DR

This paper introduces an ASR-guided speech intelligibility metric based on phone error rate (PER) for selecting the best TTS model, demonstrating improved human-perceived intelligibility over traditional training metrics, especially across different genres.

Contribution

The paper proposes a novel PER-based objective metric for TTS model selection, validated through subjective studies and applicable across languages and genres.

Findings

01

PER correlates better with human perception than training loss

02

Models selected with lowest PER show higher intelligibility

03

Method is effective across different genres and languages

Abstract

The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility. The PER is computed between the input text to the TTS model, and the text decoded from the synthesized speech using an automatic speech recognition (ASR) model, which is trained on the same data as the TTS model. With the help of subjective studies, we show that the TTS model chosen with the least PER on validation split has significantly higher speech intelligibility compared to the model with the least training-objective metric loss. Finally, using the proposed PER and subjective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing