Dual Turing Test: A Framework for Detecting and Mitigating Undetectable AI
Alberto Messina

TL;DR
This paper introduces the dual Turing test framework, combining adversarial classification, quality constraints, and reinforcement learning to detect and mitigate undetectable AI outputs.
Contribution
It formalizes the dual Turing test as a minimax game and integrates it into an RL alignment pipeline with explicit quality and undetectability measures.
Findings
Formal dual Turing test framework with guarantees
Integration of undetectability detector in RL alignment
Enhanced detection of stealthy AI outputs
Abstract
In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the "dual Turing test", in which a human judge's goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge's task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Computability, Logic, AI Algorithms · Ethics and Social Impacts of AI
