Multi-Objective Reward and Preference Optimization: Theory and Algorithms

Akhil Agnihotri

arXiv:2512.10601·cs.LG·December 12, 2025

Multi-Objective Reward and Preference Optimization: Theory and Algorithms

Akhil Agnihotri

PDF

Open Access

TL;DR

This thesis develops new algorithms and theoretical frameworks for constrained reinforcement learning, preference learning, and large language model alignment, achieving state-of-the-art performance and scalability.

Contribution

It introduces novel algorithms like ACPO, e-COP, warmPref-PS, PSPL, and MOPO, unifying constrained RL and preference-based methods with theoretical guarantees and practical scalability.

Findings

01

ACPO achieves state-of-the-art empirical performance with theoretical guarantees.

02

e-COP provides provable performance in episodic constrained RL environments.

03

MOPO scales to multi-billion-parameter language models for alignment.

Abstract

This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference