Resource Consumption Red-Teaming for Large Vision-Language Models

Haoran Gao; Yuanhe Zhang; Zhenhong Zhou; Lei Jiang; Fanyu Meng; Yujia Xiao; Li Sun; Kun Wang; Yang Liu; Junlan Feng

arXiv:2507.18053·cs.CR·September 29, 2025

Resource Consumption Red-Teaming for Large Vision-Language Models

Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Li Sun, Kun Wang, Yang Liu, Junlan Feng

PDF

3 Reviews

TL;DR

This paper introduces RECITE, a novel red-teaming approach exploiting visual inputs to trigger resource consumption attacks in large vision-language models, revealing security vulnerabilities and increasing resource usage.

Contribution

RECITE is the first method to exploit visual modalities for red-teaming LVLMs, using pixel-level adversarial perturbations to induce unbounded resource consumption attacks.

Findings

01

Increases response latency by over 26 times

02

Raises GPU utilization and memory consumption by 20%

03

Reveals security vulnerabilities in LVLMs

Abstract

Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have mainly overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECITE ( $Re$ source $C$ onsumpt $i$ on Red- $Te$ aming for LVLMs), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present $Vision Guided Optimization$ , a fine-grained pixel-level optimization to obtain \textit{Output Recall Objective} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The problem studied is timely.

Weaknesses

- The paper positions the work as a red-teaming effort, but the proposed method is more accurately described as a specific attack. In general, red-teaming involves systematically identifying a range of vulnerabilities, including those without concrete exploits, and typically provides comprehensive analysis and actionable recommendations. These broader aspects are missing from the current paper. - Figure 3 measures semantic consistency, but its relevance to a denial-of-service and red-teaming s

Reviewer 02Rating 2Confidence 3

Strengths

1. The paper demonstrates that visual inputs alone can reliably trigger severe resource consumption attacks (RCAs) in large vision-language models (LVLMs). 2. The authors conduct extensive experiments across seven LVLMs from three major families (LLaVA, Qwen, BLIP), using diverse metrics (Output Time GPU Utilization Memory Usage) and multiple attack configurations. 3. The method section is technically thorough, with precise definitions of the Output Recall Objective and Vision Guided Optimizatio

Weaknesses

1. The claim that this is the “first” vision-based resource consumption red-teaming for LVLMs appears overstated. Prior work such as Gao et al. (ICLR 2024) [1] also leverages visual inputs to induce high latency/energy consumption in LVLMs. The paper should clarify how RE-CITE differs conceptually and technically from such approaches. 2. Figure 1, which depicts the RE-CITE pipeline, lacks sufficient clarity. Key components—such as visual encoding, embedding projection, and the iterative perturb

Reviewer 03Rating 6Confidence 3

Strengths

- The paper introduces a new attack surface—visual inputs causing resource consumption—which has not been systematically explored before. This problem is both novel and practically relevant. - The proposed RECITE framework is simple yet effective, providing a structured way to red-team LVLMs for resource-related vulnerabilities. - The experimental validation is extensive, involving multiple models and metrics (output length, GPU utilization, latency, memory). The results strongly support the m

Weaknesses

- The theoretical explanation of why visual perturbations cause looping behavior is insufficient. The paper would benefit from a formal analysis of the model’s stopping dynamics, such as EOS logit suppression or entropy evolution. - The Output Recall Objective is largely heuristic. There is no ablation comparing it to simpler baselines such as minimizing the EOS token probability or tuning length penalties, which makes it unclear how necessary this specific objective is. - The defense section

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.