The State Of TTS: A Case Study with Human Fooling Rates

Praveen Srinivasa Varadhan; Sherry Thomas; Sai Teja M. S.; Suvrat Bhooshan; Mitesh M. Khapra

arXiv:2508.04179·cs.CL·August 7, 2025

The State Of TTS: A Case Study with Human Fooling Rates

Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra

PDF

TL;DR

This paper introduces the Human Fooling Rate metric to evaluate how often TTS systems are mistaken for human speech, revealing that current models often fall short of true human-like naturalness and highlighting the importance of realistic benchmarks.

Contribution

The paper proposes the Human Fooling Rate metric and provides a large-scale evaluation of TTS models, offering new insights into their human deception capabilities and evaluation practices.

Findings

01

Commercial models approach human deception in zero-shot settings

02

Open-source systems still struggle with natural conversational speech

03

High-quality fine-tuning improves realism but doesn't fully close the gap

Abstract

While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.