Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen

TL;DR
Open CaptchaWorld is a new web-based benchmark platform designed to evaluate multimodal large language models' visual reasoning and interaction skills through diverse CAPTCHA puzzles, revealing current models' limitations compared to humans.
Contribution
This paper introduces the first comprehensive platform and benchmark for testing multimodal LLM agents on interactive CAPTCHA tasks, including a novel metric for reasoning depth.
Findings
Humans achieve near-perfect CAPTCHA solving accuracy (93.3%).
State-of-the-art MLLM agents reach at most 40.0% success rate.
Open CaptchaWorld exposes significant gaps in current multimodal reasoning capabilities.
Abstract
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsUmbrella Reinforcement Learning
