TL;DR
This paper introduces RECITE, a novel red-teaming approach exploiting visual inputs to trigger resource consumption attacks in large vision-language models, revealing security vulnerabilities and increasing resource usage.
Contribution
RECITE is the first method to exploit visual modalities for red-teaming LVLMs, using pixel-level adversarial perturbations to induce unbounded resource consumption attacks.
Findings
Increases response latency by over 26 times
Raises GPU utilization and memory consumption by 20%
Reveals security vulnerabilities in LVLMs
Abstract
Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have mainly overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECITE (source onsumpton Red-aming for LVLMs), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present , a fine-grained pixel-level optimization to obtain \textit{Output Recall Objective} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into…
Peer Reviews
Decision·Submitted to ICLR 2026
- The problem studied is timely.
- The paper positions the work as a red-teaming effort, but the proposed method is more accurately described as a specific attack. In general, red-teaming involves systematically identifying a range of vulnerabilities, including those without concrete exploits, and typically provides comprehensive analysis and actionable recommendations. These broader aspects are missing from the current paper. - Figure 3 measures semantic consistency, but its relevance to a denial-of-service and red-teaming s
1. The paper demonstrates that visual inputs alone can reliably trigger severe resource consumption attacks (RCAs) in large vision-language models (LVLMs). 2. The authors conduct extensive experiments across seven LVLMs from three major families (LLaVA, Qwen, BLIP), using diverse metrics (Output Time GPU Utilization Memory Usage) and multiple attack configurations. 3. The method section is technically thorough, with precise definitions of the Output Recall Objective and Vision Guided Optimizatio
1. The claim that this is the “first” vision-based resource consumption red-teaming for LVLMs appears overstated. Prior work such as Gao et al. (ICLR 2024) [1] also leverages visual inputs to induce high latency/energy consumption in LVLMs. The paper should clarify how RE-CITE differs conceptually and technically from such approaches. 2. Figure 1, which depicts the RE-CITE pipeline, lacks sufficient clarity. Key components—such as visual encoding, embedding projection, and the iterative perturb
- The paper introduces a new attack surface—visual inputs causing resource consumption—which has not been systematically explored before. This problem is both novel and practically relevant. - The proposed RECITE framework is simple yet effective, providing a structured way to red-team LVLMs for resource-related vulnerabilities. - The experimental validation is extensive, involving multiple models and metrics (output length, GPU utilization, latency, memory). The results strongly support the m
- The theoretical explanation of why visual perturbations cause looping behavior is insufficient. The paper would benefit from a formal analysis of the model’s stopping dynamics, such as EOS logit suppression or entropy evolution. - The Output Recall Objective is largely heuristic. There is no ablation comparing it to simpler baselines such as minimizing the EOS token probability or tuning length penalties, which makes it unclear how necessary this specific objective is. - The defense section
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
