Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Praveen Srinivasa Varadhan; Amogh Gulati; Ashwin Sankar; Srija Anand; Anirudh Gupta; Anirudh Mukherjee; Shiva Kumar Marepally; Ankur Bhatia; Saloni Jaju; Suvrat Bhooshan; Mitesh M. Khapra

arXiv:2411.12719·cs.CL·May 28, 2025

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Praveen Srinivasa Varadhan, Amogh Gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M. Khapra

PDF

Open Access 4 Reviews

TL;DR

This paper critically evaluates the MUSHRA test for TTS evaluation, identifies its limitations with modern high-quality systems, and proposes refined variants along with a large dataset to improve and analyze human speech evaluation.

Contribution

It introduces two improved MUSHRA variants addressing bias and ambiguity, and releases MANGO, a large Indian language TTS evaluation dataset.

Findings

01

Refined MUSHRA variants improve reliability and granularity of TTS evaluation.

02

Identified bias and ambiguity issues in traditional MUSHRA testing.

03

Released MANGO dataset with 246,000 ratings for Indian languages.

Abstract

Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii)…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

This paper addresses two important issues in subjective evaluation, which is an essential part of developing speech synthesis systems. The authors have created a large-scale dataset called MANGO, which I expect could be utilized in several applications in speech quality assessment beyond the analysis presented in this paper. They study existing MUSHRA scores from various perspectives, including reliability, sensitivity, and rejection mechanisms. The proposed approaches effectively address these

Weaknesses

The MUSHRA-DG variant would significantly increase the time cost and difficulty of scoring for human subjects. Allowing subjects to adjust scores if they feel the final MUSHRA score does not match their expectations seems questionable to me, as it may encourage them to revise sub-scores only to fit the final outcome instead of focusing on the fine-grained scores they should judge fairly. Additionally, the combination of the two proposed approaches does not appear to be as effective as the indivi

Reviewer 02Rating 3Confidence 5

Strengths

It is important to discuss about the evaluations of TTS systems. It is an open-topic and given the subjective nature of these evaluations, it is useful to understand the shortcomings of the used methodology and hopefully agree as a community to improved guidelines. Some of the claims make in the paper make sense, like the need of clear guidelines to help with ambiguity, the importance of a sufficient number of listeners or that a hidden reference is a better choice. I also think that in the TTS

Weaknesses

The biggest drawbacks of the paper for me is the lack of technical details on how the compared TTS systems are trained and on the data used. Concerning the systems, I understand that an open-source version of each one is used, but did you do any changes on top of them? Did you use them as is and did inference? Did you do fine-tuning on your data? Similarly there is a lack of details on the data used. If you did some kind of fine-tuning, what data did you use? What about inference? Was it done o

Reviewer 03Rating 3Confidence 4

Strengths

The paper analyzed the advantages and disadvantages of MUSHRA test in detail.

Weaknesses

The paper used nearly all space to rethink the MUSHRA test. However, the proposed methods seem to be too limited and hard to reproduce.

Reviewer 04Rating 6Confidence 3

Strengths

- This paper highlights a significant issue with the foundational assumption of MUSHRA scores: it presupposes that a real reference sample should consistently receive high scores. However, this may not always be the case, as modern TTS systems can sometimes outperform human references. - The authors conduct a thorough analysis of MUSHRA test results, providing both qualitative and quantitative evidence to demonstrate the two major shortcomings of the original MUSHRA framework. - In addition to a

Weaknesses

- While the issues with MUSHRA are language-agnostic, this paper focuses exclusively on TTS systems for Indian languages, which are trained on relatively small datasets. The findings would be more compelling if the authors included results from widely-used datasets of high-resource languages and publicly available pre-trained English (or any other high-resource language) TTS systems with verified quality. - The nine aspects selected for MUSHRA-DG appear to be arbitrary, as the authors do not pro

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection