Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Yining Li; Peizhong Ju; Ness Shroff

arXiv:2602.22146·cs.LG·February 26, 2026

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Yining Li, Peizhong Ju, Ness Shroff

PDF

Open Access

TL;DR

This paper introduces an optimistic primal-dual algorithm with provable last-iterate convergence for safe reinforcement learning from human feedback, improving stability and theoretical guarantees in aligning large language models.

Contribution

It proposes a universal primal-dual framework and an optimistic primal-dual method that guarantees last-iterate convergence in safe RLHF, unifying and enhancing existing alignment algorithms.

Findings

01

Proves last-iterate convergence for the proposed method.

02

Demonstrates stability improvements over standard primal-dual methods.

03

Shows convergence to a neighborhood of the optimal solution.

Abstract

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Natural Language Processing Techniques