TL;DR
This paper introduces SymPO, a robust policy optimization framework that uses symmetric losses to effectively handle noisy human preference data in reinforcement learning, ensuring reliable policy improvement.
Contribution
It proposes a novel approach applying symmetric losses to reward modeling, providing theoretical guarantees for robustness against noisy preferences in policy optimization.
Findings
SymPO outperforms traditional methods on noisy preference data.
Symmetric losses preserve reward ranking under label noise.
Theoretical analysis confirms robustness of SymPO.
Abstract
Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases. We propose a principled framework for robust policy optimization under noisy preferences, viewing reward modeling as a classification problem. This allows us to leverage symmetric losses, known for their robustness to label noise in classification, leading to our Symmetric Preference Optimization (SymPO) method. We prove that symmetric losses enable successful policy optimization even under noisy labels, as the resulting reward remains rank-preserving -- a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
