Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li, Xin Wang, Yiwei Wei, Jiujiang Guo, Jiahuan Long, Tingsong Jiang, Wen Yao

TL;DR
This paper introduces Multimodal Semantic Lighting Attacks (MSLA), a novel physical adversarial attack method that disrupts vision-language models using controllable lighting, revealing significant vulnerabilities in real-world scenarios.
Contribution
It is the first to systematically study physically deployable semantic attacks on VLMs, demonstrating their effectiveness and exposing a critical robustness gap.
Findings
MSLA significantly degrades zero-shot classification accuracy.
MSLA induces semantic hallucinations in image captioning and VQA.
Physical attacks using lighting are practical and transferable.
Abstract
Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
