CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training
Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang, Gang Wang, Huan Zhang

TL;DR
This paper introduces ReCAP, a native GUI agent capable of solving complex CAPTCHA challenges through automated data generation and self-corrective training, significantly improving success rates while maintaining general GUI performance.
Contribution
We develop a dynamic CAPTCHA system and an automated data pipeline, enabling training of robust, self-correcting GUI agents for CAPTCHA solving and general GUI tasks.
Findings
ReCAP achieves up to 80% success rate on CAPTCHA challenges.
ReCAP maintains strong performance on general GUI benchmarks.
Automated data curation enhances training effectiveness.
Abstract
GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUser Authentication and Security Systems · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI
