The State Of TTS: A Case Study with Human Fooling Rates
Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra

TL;DR
This paper introduces the Human Fooling Rate metric to evaluate how often TTS systems are mistaken for human speech, revealing that current models often fall short of true human-like naturalness and highlighting the importance of realistic benchmarks.
Contribution
The paper proposes the Human Fooling Rate metric and provides a large-scale evaluation of TTS models, offering new insights into their human deception capabilities and evaluation practices.
Findings
Commercial models approach human deception in zero-shot settings
Open-source systems still struggle with natural conversational speech
High-quality fine-tuning improves realism but doesn't fully close the gap
Abstract
While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
