Alignment of large language models with constrained learning
Botong Zhang, Shuo Li, Ignacio Hounie, Osbert Bastani, Dongsheng Ding, Alejandro Ribeiro

TL;DR
This paper introduces a dual-based iterative method for aligning large language models with constraints, addressing convergence issues of previous methods and providing theoretical guarantees of near-optimality.
Contribution
The paper develops a novel dual-based alignment algorithm for LLMs that guarantees convergence to near-optimal constrained policies, with theoretical analysis and empirical validation.
Findings
The proposed method converges to near-optimal constrained policies.
Theoretical bounds on the primal-dual gap are established.
Experimental results demonstrate the effectiveness on RLHF datasets.
Abstract
We study the problem of computing an optimal large language model (LLM) policy for the constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
