TL;DR
This paper introduces RMiPO, a new framework for offline preference optimization that uses intrinsic mutual information to improve performance and reduce hyperparameter tuning overhead in aligning LLMs with human values.
Contribution
RMiPO is a lightweight, efficient method that dynamically modulates preference contributions using intrinsic mutual information, outperforming existing approaches.
Findings
RMiPO achieves consistently superior performance over existing methods.
Reduces training overhead by more than 15%.
Leverages intrinsic response-level mutual information for preference optimization.
Abstract
Offline preference optimization methods, such as Direct Preference Optimization (DPO), offer significant advantages in aligning Large Language Models (LLMs) with human values. However, achieving optimal performance with these methods typically involves additional hyperparameter tuning, resulting in substantial time overhead. Although prior work has proposed a range of improvements, these methods remain limited in effectiveness and have not fully eliminated reliance on hyperparameter tuning. In this work, we propose RMiPO, a lightweight and efficient framework for offline preference optimization. RMiPO leverages intrinsic Response-level Mutual information for Preference Optimization with hyperparameter modulation, dynamically decoupling preference contributions at negligible additional computational cost. Extensive experimental results demonstrate that RMiPO achieves consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
