VIPER Strike: Defeating Visual Reasoning CAPTCHAs via Structured Vision-Language Inference

Minfeng Qi; Dongyang He; Qin Wang; Lefeng Zhang

arXiv:2601.06461·cs.CR·January 13, 2026

VIPER Strike: Defeating Visual Reasoning CAPTCHAs via Structured Vision-Language Inference

Minfeng Qi, Dongyang He, Qin Wang, Lefeng Zhang

PDF

Open Access

TL;DR

This paper introduces ViPer, a unified framework combining visual perception and language reasoning to effectively solve visual reasoning CAPTCHAs, achieving near-human success rates and outperforming prior methods.

Contribution

ViPer integrates structured visual perception with adaptive language reasoning, providing a general, robust attack framework for diverse visual reasoning CAPTCHAs.

Findings

01

ViPer achieves up to 93.2% success rate on multiple VRCs.

02

ViPer outperforms prior solvers like GraphNet, Oedipus, and Holistic approach.

03

Template-Space Randomization reduces solver effectiveness.

Abstract

Visual Reasoning CAPTCHAs (VRCs) combine visual scenes with natural-language queries that demand compositional inference over objects, attributes, and spatial relations. They are increasingly deployed as a primary defense against automated bots. Existing solvers fall into two paradigms: vision-centric, which rely on template-specific detectors but fail on novel layouts, and reasoning-centric, which leverage LLMs but struggle with fine-grained visual perception. Both lack the generality needed to handle heterogeneous VRC deployments. We present ViPer, a unified attack framework that integrates structured multi-object visual perception with adaptive LLM-based reasoning. ViPer parses visual layouts, grounds attributes to question semantics, and infers target coordinates within a modular pipeline. Evaluated on six major VRC providers (VTT, Geetest, NetEase, Dingxiang, Shumei, Xiaodun),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · User Authentication and Security Systems