AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson,, Rif A. Saurous, D. Sculley

TL;DR
AutoMOS is a deep learning model that predicts the naturalness of synthesized speech directly from raw waveforms, offering a non-intrusive, automated alternative to human ratings with high correlation to human judgments.
Contribution
This paper introduces AutoMOS, a novel deep recurrent neural network that estimates speech naturalness scores from raw audio, reducing reliance on human raters.
Findings
AutoMOS achieves high correlation with human ratings at utterance level.
The model's averaged scores over multiple utterances approach human rater consistency.
AutoMOS enables efficient exploration of speech synthesizer parameters without human input.
Abstract
Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech Recognition and Synthesis
