CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Yuxi Chen; Haoyu Zhai; Chenkai Wang; Rui Yang; Lingming Zhang; Gang Wang; Huan Zhang

arXiv:2603.23559·cs.CR·March 26, 2026

CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang, Gang Wang, Huan Zhang

PDF

Open Access

TL;DR

This paper introduces ReCAP, a native GUI agent capable of solving complex CAPTCHA challenges through automated data generation and self-corrective training, significantly improving success rates while maintaining general GUI performance.

Contribution

We develop a dynamic CAPTCHA system and an automated data pipeline, enabling training of robust, self-correcting GUI agents for CAPTCHA solving and general GUI tasks.

Findings

01

ReCAP achieves up to 80% success rate on CAPTCHA challenges.

02

ReCAP maintains strong performance on general GUI benchmarks.

03

Automated data curation enhances training effectiveness.

Abstract

GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsUser Authentication and Security Systems · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI