Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese
Xihuai Wang, Ziyi Zhao, Siyu Ren, Shao Zhang, Song Li, Xiaoyu Li, Ziwen Wang, Lin Qiu, Guanglu Wan, Xuezhi Cao, Xunliang Cai, Weinan Zhang

TL;DR
This paper introduces the Audio Turing Test (ATT), a multi-dimensional evaluation framework for Chinese TTS systems that reduces subjectivity and improves robustness by focusing on human-likeness judgments, supported by a new dataset and automatic evaluation method.
Contribution
It presents the ATT framework, a new multi-dimensional Chinese corpus, and an automatic evaluation method, Auto-ATT, to better assess human-likeness of TTS systems compared to traditional MOS scores.
Findings
ATT effectively differentiates TTS models across multiple dimensions.
Auto-ATT aligns well with human judgments, enabling rapid evaluation.
The dataset and methods improve robustness and interpretability of TTS evaluation.
Abstract
Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT…
Peer Reviews
Decision·Submitted to ICLR 2026
- ATT attempts to address the critical limitations of MOS / pseudo-MOS by disentagling speech characteristics at the data level and simplifying the evaluation scheme - ATT evaluates along several axes, such as numerals, code-switching, paralinguistics, and poetry. - ATT can clearly distinguish the strengths and weakness of different model along each axes, allowing fine-grained insights of TTS performance - Auto-ATT is a novel model-as-judge that can be used to automate the application of ATT a
- The ATT corpus is developed using the TTS models the authors intend on evaluating. It is unclear how it and AutoATT generalize to unseen systems, which does not address the claimed robustness issue of pseudo-MOS. - ATT cannot distinguish speaker-level characteristics, which makes evaluation using speaker similiarity MOS or neural embeddings still required
* Multidimensional corpus targets common Chinese difficulty factors (polyphony, poetry syntax, code-switching). * Ternary human protocol + rationales is a simple but meaningful shift away from MOS. * The implementation of trap items as a good quality control. * Auto-ATT is a useful direction; training a speech-judge model is under-explored, and the demonstrated correlation to humans is promising. * The benchmark highlights meaningful gaps between SOTA models and human speech.
1. (Interpretability of Human-Likeness) HLS collapses three distinct cases (Human mistaken as Machine, Machine mistaken as Human, and Unclear) into a single linear score. Without reporting how often each category is chosen, it is unclear whether high HLS reflects genuine human-likeness or annotator uncertainty. Excessive “Unclear” selections may artificially inflate scores. 2. (Filtering Bias From Manual Spot Checks) The authors state that samples failing “synthesis success” or “synthesis consi
Targeted Solution to Critical Gaps: Addresses MOS’s limitations (subjectivity, low interpretability) and the lack of multi-dimensional, Chinese-specific TTS evaluation datasets, filling a key niche in LLM-driven TTS assessment. Comprehensive Framework Design: Combines a well-constructed corpus (semi-automated generation + expert validation), rigorous human evaluation (trap items, consistency checks), and an efficient automatic tool, enabling both qualitative and quantitative analysis. Robust Exp
Language and Scenario Limitation: The framework is exclusively designed for Chinese, limiting generalizability to other languages with distinct linguistic features (e.g., tonal vs. non-tonal languages). Narrow Trap Item Diversity: While trap items monitor attention, the paper only mentions "deliberately flawed synthetic clips" and "genuine human recordings"—more diverse trap types (e.g., edge-case linguistic structures) could strengthen robustness. Auto-ATT Training Data Opacity: The paper refer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · AI in Service Interactions · Face recognition and analysis
