Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability
Xiang Chen, Yuxian Dong, Chao Li, Chengyin Hu, Jiaju Han, Fengyu Zhang, Yiwei Wei, Jiahuan Long, Jiujiang Guo

TL;DR
This paper introduces CFGPatch, a novel geometric adversarial patch framework that exploits cross-modal vulnerabilities in visible-infrared vision-language models, demonstrating high attack success and transferability across tasks.
Contribution
The paper presents CFGPatch, a unified curved-edge fractal adversarial framework that enhances attack effectiveness and cross-task transferability in VIS-IR VLMs.
Findings
CFGPatch effectively fools VIS-IR VLMs with high success rates.
Adversarial samples transfer well to image captioning and VQA tasks.
Outperforms standard patch baselines in robustness and effectiveness.
Abstract
Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
