VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service

Xiasi Wang; Tianliang Yao; Simin Chen; Runqi Wang; Lei YE; Kuofeng Gao; Yi Huang; Yuan Yao

arXiv:2506.15755·cs.CV·June 23, 2025

VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service

Xiasi Wang, Tianliang Yao, Simin Chen, Runqi Wang, Lei YE, Kuofeng Gao, Yi Huang, Yuan Yao

PDF

Open Access 1 Video

TL;DR

This paper introduces VLMInferSlow, a black-box evaluation method for assessing the efficiency robustness of vision-language models, revealing their vulnerability to adversarial images that significantly increase inference costs.

Contribution

It presents a novel black-box approach to evaluate efficiency robustness of VLMs, addressing practical deployment scenarios and uncovering their susceptibility to adversarial attacks.

Findings

01

Adversarial images can increase VLM inference costs by up to 128.47%.

02

VLMInferSlow effectively finds imperceptible perturbations that impact efficiency.

03

The study highlights the need for robustness considerations in VLM deployment.

Abstract

Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters -- an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning