Meta-Learning Objectives for Preference Optimization

Carlo Alfano; Silvia Sapora; Jakob Nicolaus Foerster; Patrick Rebeschini; Yee Whye Teh

arXiv:2411.06568·cs.LG·January 9, 2026

Meta-Learning Objectives for Preference Optimization

Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh

PDF

Open Access

TL;DR

This paper introduces a diagnostic benchmark suite for preference optimization algorithms using MuJoCo tasks, proposes a new class of algorithms called Mirror Preference Optimization, and demonstrates their superior performance in both MuJoCo and LLM alignment tasks.

Contribution

It develops a controlled benchmark for preference optimization, introduces a novel mirror descent-based algorithm family, and applies insights to improve LLM alignment performance.

Findings

01

Discovered PO algorithms outperform existing methods in MuJoCo tasks.

02

Designed a diagnostic suite enabling systematic evaluation of PO algorithms.

03

Achieved significant improvements in LLM alignment using the new PO algorithm.

Abstract

Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDecision-Making and Behavioral Economics

MethodsDirect Preference Optimization · Parrot optimizer: Algorithm and applications to medical problems