TL;DR
This paper evaluates how multimodal large language models can break visual CAPTCHA security, analyzes their strengths and weaknesses, and proposes design guidelines to improve CAPTCHA robustness against such AI-driven attacks.
Contribution
It provides a comprehensive evaluation of MLLMs on various CAPTCHA types, analyzes their reasoning mechanisms, and offers practical guidelines for designing more secure CAPTCHAs.
Findings
MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHAs at human-like cost.
Tasks requiring localization or multi-step reasoning remain challenging for current models.
Hardening CAPTCHA with localization and counting reduces success rate from over 95% to 0%.
Abstract
This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
