An ASR Guided Speech Intelligibility Measure for TTS Model Selection
Arun Baby, Saranya Vinnaitherthan, Nagaraj Adiga, Pranav Jawale,, Sumukh Badam, Sharath Adavanne, Srikanth Konjeti

TL;DR
This paper introduces an ASR-guided speech intelligibility metric based on phone error rate (PER) for selecting the best TTS model, demonstrating improved human-perceived intelligibility over traditional training metrics, especially across different genres.
Contribution
The paper proposes a novel PER-based objective metric for TTS model selection, validated through subjective studies and applicable across languages and genres.
Findings
PER correlates better with human perception than training loss
Models selected with lowest PER show higher intelligibility
Method is effective across different genres and languages
Abstract
The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility. The PER is computed between the input text to the TTS model, and the text decoded from the synthesized speech using an automatic speech recognition (ASR) model, which is trained on the same data as the TTS model. With the help of subjective studies, we show that the TTS model chosen with the least PER on validation split has significantly higher speech intelligibility compared to the model with the least training-objective metric loss. Finally, using the proposed PER and subjective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
