CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving
Jianyi Zhang, Ziyin Zhou, Xu Ji, Shizhao Liu, Zhangchi Zhao

TL;DR
This paper introduces CAPTURE, a comprehensive benchmark for evaluating Large Visual Language Models (LVLMs) on various CAPTCHA types, revealing their current limitations in solving CAPTCHA challenges.
Contribution
The paper presents the first dedicated CAPTCHA benchmark for LVLMs, covering diverse types and sub-types, with extensive data and tailored labels for thorough evaluation.
Findings
LVLMs perform poorly on CAPTCHA tasks
CAPTURE covers 4 main types and 25 sub-types from 31 vendors
Benchmark fills gaps in data diversity and labeling
Abstract
Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · User Authentication and Security Systems
