wDPO: Winsorized Direct Preference Optimization for Robust LLM Alignment
Jilong Liu, Yonghui Yang, Pengyang Shao, Haokai Ma, Wei Qin, Richang Hong

TL;DR
wDPO introduces a hierarchical winsorization method to improve the robustness of large language model alignment by effectively handling different types of noisy preference data during training.
Contribution
The paper proposes wDPO, a novel hierarchical winsorization approach that targets specific noise types in preference data, enhancing robustness over existing DPO variants.
Findings
wDPO outperforms vanilla DPO and baselines on safety benchmarks.
wDPO shows significant robustness under label-flip noise.
Hierarchical interventions improve preference alignment quality.
Abstract
Direct Preference Optimization (DPO) aligns large language models by optimizing pairwise preferences and has shown remarkable effectiveness as a simple and scalable alternative to RLHF. However, in practice, preference data are often noisy. Existing robust variants of DPO mainly rely on uniform objective modifications or global reweighting. While partially effective, these methods treat noisy samples as a homogeneous source of uncertainty and fail to distinguish between different noise types, leading to sub-optimal alignment robustness. In this work, we show that robust preference alignment benefits from addressing different noise types with targeted interventions rather than uniform regularization. We propose winsorized Direct Preference Optimization~(wDPO), a robust LLM alignment approach with hierarchical winsorization. Specifically, wDPO adopts a reward-free hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Constraint Satisfaction and Optimization · Advanced Multi-Objective Optimization Algorithms
