On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

Jordan Vice; Naveed Akhtar; Yansong Gao; Richard Hartley; Ajmal Mian

arXiv:2507.22398·cs.CV·August 14, 2025

On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian

PDF

TL;DR

This paper reveals that vision-language models are vulnerable to subtle frequency-domain perturbations, which can significantly affect their performance in image captioning and DeepFake detection, raising concerns about their reliability.

Contribution

The study introduces a systematic method to perturb images in the frequency domain, exposing vulnerabilities of state-of-the-art VLMs across multiple datasets and models.

Findings

01

Frequency perturbations undermine VLM accuracy.

02

VLM judgments are sensitive to frequency cues.

03

Vulnerabilities exist under black-box conditions.

Abstract

Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.