Meta-Learning Objectives for Preference Optimization
Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh

TL;DR
This paper introduces a diagnostic benchmark suite for preference optimization algorithms using MuJoCo tasks, proposes a new class of algorithms called Mirror Preference Optimization, and demonstrates their superior performance in both MuJoCo and LLM alignment tasks.
Contribution
It develops a controlled benchmark for preference optimization, introduces a novel mirror descent-based algorithm family, and applies insights to improve LLM alignment performance.
Findings
Discovered PO algorithms outperform existing methods in MuJoCo tasks.
Designed a diagnostic suite enabling systematic evaluation of PO algorithms.
Achieved significant improvements in LLM alignment using the new PO algorithm.
Abstract
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDecision-Making and Behavioral Economics
MethodsDirect Preference Optimization · Parrot optimizer: Algorithm and applications to medical problems
