Multi-Objective Reward and Preference Optimization: Theory and Algorithms
Akhil Agnihotri

TL;DR
This thesis develops new algorithms and theoretical frameworks for constrained reinforcement learning, preference learning, and large language model alignment, achieving state-of-the-art performance and scalability.
Contribution
It introduces novel algorithms like ACPO, e-COP, warmPref-PS, PSPL, and MOPO, unifying constrained RL and preference-based methods with theoretical guarantees and practical scalability.
Findings
ACPO achieves state-of-the-art empirical performance with theoretical guarantees.
e-COP provides provable performance in episodic constrained RL environments.
MOPO scales to multi-billion-parameter language models for alignment.
Abstract
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
