Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training
Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, Wenjie Wang

TL;DR
This paper introduces Adversary-aware DPO (ADPO), a novel training framework that enhances the safety and robustness of vision-language models against adversarial attacks by integrating adversarial training into the preference optimization process.
Contribution
It proposes a new adversarial training method for VLMs that explicitly considers worst-case perturbations, improving safety alignment and robustness over existing post-hoc methods.
Findings
ADPO outperforms baseline methods in safety alignment.
Enhances robustness of VLMs against jailbreak attacks.
Improves general utility of vision-language models.
Abstract
Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose , a novel training framework that explicitly considers adversarial. integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
