Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees

Yongtao Wu; Luca Viano; Yihang Chen; Zhenyu Zhu; Kimon Antonakopoulos; Quanquan Gu; Volkan Cevher

arXiv:2502.12678·cs.LG·May 27, 2025

Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees

Yongtao Wu, Luca Viano, Yihang Chen, Zhenyu Zhu, Kimon Antonakopoulos, Quanquan Gu, Volkan Cevher

PDF

Open Access

TL;DR

This paper models multi-turn language model alignment as a Markov game and introduces OMPO, an optimistic online gradient descent method with proven convergence guarantees, improving upon existing bandit-based approaches.

Contribution

It formulates the alignment problem as a multi-step Markov game and proposes OMPO, a novel algorithm with theoretical convergence guarantees for multi-turn preference optimization.

Findings

01

OMPO converges to an approximate Nash equilibrium within O(ε^{-1}) updates.

02

The method outperforms existing approaches on multi-turn conversation datasets.

03

Theoretical analysis confirms convergence properties of OMPO.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Optimistic Multi-step Preference Optimization (OMPO) is built upon the optimistic online mirror descent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Search Problems · Auction Theory and Applications · Scheduling and Optimization Algorithms

MethodsDirect Preference Optimization