Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language   Models via Adversarial Training

Fenghua Weng; Jian Lou; Jun Feng; Minlie Huang; Wenjie Wang

arXiv:2502.11455·cs.CR·February 18, 2025

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, Wenjie Wang

PDF

Open Access

TL;DR

This paper introduces Adversary-aware DPO (ADPO), a novel training framework that enhances the safety and robustness of vision-language models against adversarial attacks by integrating adversarial training into the preference optimization process.

Contribution

It proposes a new adversarial training method for VLMs that explicitly considers worst-case perturbations, improving safety alignment and robustness over existing post-hoc methods.

Findings

01

ADPO outperforms baseline methods in safety alignment.

02

Enhances robustness of VLMs against jailbreak attacks.

03

Improves general utility of vision-language models.

Abstract

Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose $Adversary-aware DPO (ADPO)$ , a novel training framework that explicitly considers adversarial. $Adversary-aware DPO (ADPO)$ integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. $ADPO$ introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling