Reverse Preference Optimization for Complex Instruction Following
Xiang Huang, Ting-En Lin, Feiteng Fang, Yuchuan Wu, Hangyu Li, Yuzhong Qu, Fei Huang, Yongbin Li

TL;DR
This paper introduces Reverse Preference Optimization (RPO), a novel method for improving instruction following in large language models by dynamically reversing constraints to reduce noise and enhance alignment with complex multi-constraint instructions.
Contribution
RPO is a new approach that reverses constraints in preference pairs to improve robustness and effectiveness in complex instruction following tasks.
Findings
RPO outperforms the DPO baseline by 4.6 and 2.5 points on two benchmarks.
RPO scales effectively from 8B to 70B parameters.
The 70B RPO model surpasses GPT-4o in performance.
Abstract
Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
