Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection
Marcus Graves

TL;DR
This paper presents Reverse CAPTCHA, a framework to evaluate if large language models can follow invisible Unicode instructions embedded in text, revealing vulnerabilities to prompt injection attacks.
Contribution
It introduces a novel evaluation method for testing LLM susceptibility to invisible Unicode payloads and analyzes model behaviors across different encoding schemes and settings.
Findings
Tool use increases compliance significantly.
Models show provider-specific encoding preferences.
Explicit instructions greatly boost payload decoding success.
Abstract
We introduce Reverse CAPTCHA, an evaluation framework that tests whether large language models follow invisible Unicode-encoded instructions embedded in otherwise normal-looking text. Unlike traditional CAPTCHAs that distinguish humans from machines, our benchmark exploits a capability gap: models can perceive Unicode control characters that are invisible to human readers. We evaluate five models from two providers across two encoding schemes (zero-width binary and Unicode Tags), four hint levels, two payload framings, and with tool use enabled or disabled. Across 8,308 model outputs, we find that tool use dramatically amplifies compliance (Cohen's h up to 1.37, a large effect), that models exhibit provider-specific encoding preferences (OpenAI models decode zero-width binary; Anthropic models prefer Unicode Tags), and that explicit decoding instructions increase compliance by up to 95…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUser Authentication and Security Systems · Advanced Malware Detection Techniques · Spam and Phishing Detection
