AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Yubo Cui; Xianchao Guan; Zijun Xiong; Zheng Zhang

arXiv:2603.29410·cs.CV·April 1, 2026

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang

PDF

TL;DR

This paper introduces AGFT, a novel fine-tuning framework that enhances zero-shot adversarial robustness of vision-language models by preserving cross-modal alignment through probabilistic and distribution consistency techniques.

Contribution

AGFT is the first method to improve zero-shot adversarial robustness while maintaining semantic cross-modal alignment using probabilistic and calibration strategies.

Findings

01

AGFT outperforms existing methods on multiple zero-shot benchmarks.

02

AGFT significantly improves zero-shot adversarial robustness.

03

AGFT preserves cross-modal semantic structure during fine-tuning.

Abstract

Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.