When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Chengyin Hu, Xuemeng Sun, Jiaju Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long

TL;DR
This paper introduces a novel method to generate realistic non-rigid surface deformations that significantly reduce the accuracy of state-of-the-art vision-language models, revealing their vulnerability to physical perturbations.
Contribution
It proposes a parametric wrinkle-based perturbation technique and an optimization framework to evaluate and demonstrate VLMs' robustness weaknesses against realistic surface deformations.
Findings
Degradation of VLM performance on captioning and VQA tasks due to proposed perturbations.
Method outperforms baseline approaches in creating effective adversarial surface deformations.
Perturbations are optimized for visual naturalness and adversarial strength using a hierarchical fitness function.
Abstract
Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
