Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

Xiang Li; Jiabao Gao; Sipei Lin; Xuan Zhou; Chi Zhang; Bo Cheng; Jiale Han; Benyou Wang

arXiv:2602.24080·cs.AI·March 3, 2026

Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

Xiang Li, Jiabao Gao, Sipei Lin, Xuan Zhou, Chi Zhang, Bo Cheng, Jiale Han, Benyou Wang

PDF

Open Access 3 Reviews

TL;DR

This paper conducts the first Turing test for speech-to-speech systems, revealing current models do not pass as human-like, and introduces a detailed evaluation framework to diagnose and improve human-likeness in conversational AI.

Contribution

It introduces a comprehensive human-likeness evaluation for S2S systems, including a taxonomy of 18 dimensions and an interpretable discrimination model.

Findings

01

No current S2S system passes the Turing test.

02

Paralinguistic features and emotional expressivity are key bottlenecks.

03

Off-the-shelf AI models are unreliable as judges.

Abstract

The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 5

Strengths

- Clean experimental setup that controls for multiple confounds and tests a diverse suit of S2S models, including TTS models using LLM-generated text. Experiments demonstrate key gaps. - Rich analysis of why models fail the test through a multifaceted analysis of the conversations qualities. This analysis involved collecting human perceptions using crowdsourcing - Experiments to see if other audio models could pass the test, with key gaps in existing models. Proposes new model and design to ge

Weaknesses

- The biggest gap to me was in the lack of details around the annotation for conversation qualities. These are barely mentioned in text, so I was expecting to see a much more detailed report in B.5. However, important questions are hard to answer, such as who annotated (which platform?), how many annotators were there, did annotators agree on these qualities, how much were annotators paid, or what quality controls were present, if any. Given the importance of this data for your results and later

Reviewer 02Rating 4Confidence 4

Strengths

1. Focused problem and clear protocol. The work targets the central question for speech‑to‑speech (S2S): Do these systems actually sound human in multi‑turn dialogue? Instead of testing isolated sub‑skills, the study frames evaluation as a Turing‑style decision under realistic interaction. The tri‑part setup, human–human, human–machine, and a TTS‑based pseudo‑human control, gives a clean yardstick for what “human‑like” means. Bilingual coverage and multiple everyday topics reduce overfitting to

Weaknesses

1. Application‑heavy, limited theoretical novelty. The main novelty lies in system integration rather than theory. The core claim, that semantics alone cannot sustain effective speech interaction, is treated as an empirical observation, not a theoretical insight. Adding paralinguistic cues (prosody, affect, persona) targets known gaps, long discussed in TTS and affective computing. The work validates their importance but does not explain underlying mechanisms or interactions, nor does it offer a

Reviewer 03Rating 8Confidence 3

Strengths

- This paper presents the first formal Turing test for S2S dialogue systems, extending evaluation beyond text to spoken interaction, which is an impactful direction given recent advances in conversational AI. - The paper convincingly shows that the bottleneck of S2S dialogue systems is no longer semantic understanding but rather paralinguistic and emotional expressivity, which is an under-explored dimension in S2S research, offering valuable insights for improving S2S design. - The interpretable

Weaknesses

- The gamified Turing test platform may attract casual participants who do not conduct the human-machine discrimination carefully. This paper does not clearly describe participants’ quality-control mechanisms such as attention checks, response-time filtering, etc. This could bias the Turing test results. - It is unclear how many unpassed Turing test cases are because LLMs avoid human disfluency cues (or other fixes of human speech deficiencies). This is an easy-to-detect feature but a minor iss

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Social Robot Interaction and HRI · Language and cultural evolution