Are You Human? An Adversarial Benchmark to Expose LLMs
Gilad Gressel, Rahul Pankajakshan, Yisroel Mirsky

TL;DR
This paper introduces an open-source benchmark dataset and evaluation framework to detect LLMs in real-time, revealing that explicit prompts are highly effective and exposing the potential misuse of AI tools by humans.
Contribution
It presents a novel benchmark with explicit and implicit challenges for real-time LLM detection, validated through extensive evaluation and user studies.
Findings
Explicit challenges detect LLMs with 78.4% success rate.
Implicit challenges detect LLMs with 22.9% success rate.
Many users unknowingly used LLMs for tasks, highlighting misuse detection.
Abstract
Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes 'implicit challenges' that exploit an LLM's instruction-following mechanism to cause role deviation, and 'exlicit challenges' that test an LLM's ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, AI, and Intellectual Property · Adversarial Robustness in Machine Learning
