COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Junyu Wang; Changjia Zhu; Yuanbo Zhou; Lingyao Li; Xu He; Mingkui Wei; Junjie Xiong

arXiv:2512.02318·cs.CR·May 13, 2026

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Mingkui Wei, Junjie Xiong

PDF

1 Repo

TL;DR

This paper evaluates how multimodal large language models can break visual CAPTCHA security, analyzes their strengths and weaknesses, and proposes design guidelines to improve CAPTCHA robustness against such AI-driven attacks.

Contribution

It provides a comprehensive evaluation of MLLMs on various CAPTCHA types, analyzes their reasoning mechanisms, and offers practical guidelines for designing more secure CAPTCHAs.

Findings

01

MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHAs at human-like cost.

02

Tasks requiring localization or multi-step reasoning remain challenging for current models.

03

Hardening CAPTCHA with localization and counting reduces success rate from over 95% to 0%.

Abstract

This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/Captcha-465E
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.