Adversarial Preference Learning for Robust LLM Alignment

Yuanfu Wang; Pengyu Wang; Chenyang Xi; Bo Tang; Junyi Zhu; Wenqiang Wei; Chen Chen; Chao Yang; Jingfeng Zhang; Chaochao Lu; Yijun Niu; Keming Mao; Zhiyu Li; Feiyu Xiong; Jie Hu; Mingchuan Yang

arXiv:2505.24369·cs.LG·June 2, 2025

Adversarial Preference Learning for Robust LLM Alignment

Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, Mingchuan Yang

PDF

TL;DR

This paper introduces Adversarial Preference Learning (APL), a novel iterative adversarial training method that enhances the robustness of large language models against adversarial attacks while maintaining utility.

Contribution

The paper proposes APL, a new adversarial training framework with a direct harmfulness metric, a conditional generative attacker, and an automated feedback loop, improving LLM robustness.

Findings

01

Achieves 83.33% harmlessness win rate over base model

02

Reduces harmful outputs from 5.88% to 0.43%

03

Lowers attack success rate by up to 65%

Abstract

Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection