An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou

TL;DR
This paper introduces a novel attack method called VTIA that manipulates images to induce VLMs to generate excessively verbose outputs, highlighting vulnerabilities in current multimodal models and their energy inefficiencies.
Contribution
The paper presents a two-stage framework combining adversarial prompt search and vision-aligned perturbation optimization to control output length in VLMs, a novel approach not previously explored.
Findings
Effective in inducing verbose outputs across multiple VLMs
Achieves significant improvements in attack efficiency and stability
Demonstrates generalization capability to different models
Abstract
With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well-organized and clearly structured. - The paper proposes the VTIA attack framework, a novel methodology that integrates reinforcement learning with vision-aligned perturbation optimization. Its two-stage decoupled design ensures that the second phase operates completely independently of the target VLM’s textual module, thereby circumventing the significant computational overhead associated with repeatedly invoking large LLMs during iterative optimization. - Experiments are con
- **Limited Evaluation Scope**: Although experiments are conducted on four models, the evaluation could be further strengthened by including a broader and more diverse set of modern VLMs (e.g., LLaVA-NEXT, Qwen3-VL) to better demonstrate generalizability. Moreover, the use of only 100 randomly selected images from the MSCOCO dataset constitutes a relatively small sample size, raising concerns about the robustness and real-world applicability of the proposed method. It is also suggested to evalua
1. For originality, this paper proposes a novel attack objective, which focuses on generation length and verbosity as a vulnerability metric. This is original and highlights an overlooked efficiency issue in VLM deployment. 2. For clarity, the proposed two-stage pipeline combining prompt search and image-space optimization is well-formulated and intuitive. 3. For quality, the evaluations across multiple VLMs provide a convincing demonstration of the generality of the method.
1. There is a lack of comparison with representative baselines. The paper does not compare against existing multimodal attack frameworks such as VLAttack, which limits understanding of its relative effectiveness. 2. The focused problem setting is incremental. Adversarial attacks on VLMs have been extensively studied, and the contribution mainly reorients the objective toward verbosity, which may be viewed as a narrow extension rather than a fundamentally new attack paradigm. 3. The quality of
- The two-stage (search-then-align) methodological framework is clear and easy to understand. - The method successfully induced the target open-source models to generate maximum-length tokens, demonstrating technical feasibility in the controlled setting.
- I cannot understand the fundamental attack scenario. The paper claims the attack increases "user costs", which makes no sense if the user is the attacker, as they would just be increasing their own costs. This is completely different from general adversarial attacks (general adversarial attacks assue the user is the attacker). If the paper suggests a DoS attack against the service provider, this is an inefficient vector: an attacker could simply spam the API with normal requests to achieve th
1. Novel attack objective. The paper directly maximizes the number of output tokens as the attack goal, providing a more stable and controllable objective compared to prior methods. 2. Effective two-stage decoupled attack framework. The separation of prompt-level RL optimization and image-level perturbation optimization is technically sound. 3. State-of-the-art results. The attack achieves state-of-the-art performance, significantly outperforming baseline methods across four VLMs.
1. Evaluation is insufficient. The method is evaluated on only 100 images randomly selected from MS-COCO. This sample size is too small; at least 1,000 images should be used for a more reliable assessment. Additionally, evaluation across multiple datasets is necessary to demonstrate the method’s generalizability. 2. Ablation studies lack consistency. Some experiments are conducted on four VLMs, while others are limited to only two, making it difficult to fairly assess the contributions of each
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Hate Speech and Cyberbullying Detection
