Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction
Xiang Li, Jiabao Gao, Sipei Lin, Xuan Zhou, Chi Zhang, Bo Cheng, Jiale Han, Benyou Wang

TL;DR
This paper conducts the first Turing test for speech-to-speech systems, revealing current models do not pass as human-like, and introduces a detailed evaluation framework to diagnose and improve human-likeness in conversational AI.
Contribution
It introduces a comprehensive human-likeness evaluation for S2S systems, including a taxonomy of 18 dimensions and an interpretable discrimination model.
Findings
No current S2S system passes the Turing test.
Paralinguistic features and emotional expressivity are key bottlenecks.
Off-the-shelf AI models are unreliable as judges.
Abstract
The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably…
Peer Reviews
Decision·ICLR 2026 Poster
- Clean experimental setup that controls for multiple confounds and tests a diverse suit of S2S models, including TTS models using LLM-generated text. Experiments demonstrate key gaps. - Rich analysis of why models fail the test through a multifaceted analysis of the conversations qualities. This analysis involved collecting human perceptions using crowdsourcing - Experiments to see if other audio models could pass the test, with key gaps in existing models. Proposes new model and design to ge
- The biggest gap to me was in the lack of details around the annotation for conversation qualities. These are barely mentioned in text, so I was expecting to see a much more detailed report in B.5. However, important questions are hard to answer, such as who annotated (which platform?), how many annotators were there, did annotators agree on these qualities, how much were annotators paid, or what quality controls were present, if any. Given the importance of this data for your results and later
1. Focused problem and clear protocol. The work targets the central question for speech‑to‑speech (S2S): Do these systems actually sound human in multi‑turn dialogue? Instead of testing isolated sub‑skills, the study frames evaluation as a Turing‑style decision under realistic interaction. The tri‑part setup, human–human, human–machine, and a TTS‑based pseudo‑human control, gives a clean yardstick for what “human‑like” means. Bilingual coverage and multiple everyday topics reduce overfitting to
1. Application‑heavy, limited theoretical novelty. The main novelty lies in system integration rather than theory. The core claim, that semantics alone cannot sustain effective speech interaction, is treated as an empirical observation, not a theoretical insight. Adding paralinguistic cues (prosody, affect, persona) targets known gaps, long discussed in TTS and affective computing. The work validates their importance but does not explain underlying mechanisms or interactions, nor does it offer a
- This paper presents the first formal Turing test for S2S dialogue systems, extending evaluation beyond text to spoken interaction, which is an impactful direction given recent advances in conversational AI. - The paper convincingly shows that the bottleneck of S2S dialogue systems is no longer semantic understanding but rather paralinguistic and emotional expressivity, which is an under-explored dimension in S2S research, offering valuable insights for improving S2S design. - The interpretable
- The gamified Turing test platform may attract casual participants who do not conduct the human-machine discrimination carefully. This paper does not clearly describe participants’ quality-control mechanisms such as attention checks, response-time filtering, etc. This could bias the Turing test results. - It is unclear how many unpassed Turing test cases are because LLMs avoid human disfluency cues (or other fixes of human speech deficiencies). This is an easy-to-detect feature but a minor iss
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Social Robot Interaction and HRI · Language and cultural evolution
