Information Theoretic Guarantees For Policy Alignment In Large Language Models
Youssef Mroueh

TL;DR
This paper provides information-theoretic bounds on policy alignment in large language models, showing how reward improvements relate to divergence measures under tail assumptions and extending results to various divergences and reward proxies.
Contribution
It establishes new upper bounds on reward improvements for policy alignment using $f$-divergences, including Rényi divergence, under tail assumptions, and connects proxy and true rewards.
Findings
Reward improvement scales with $\
Reward bounds hold under sub-gaussian tail assumptions.
Bounds extend to any $f$-divergence via order statistics and data processing inequality.
Abstract
Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an -divergence such as the divergence. The best of alignment policy selects a sample from the reference policy that has the maximum reward among independent samples. For both cases (policy alignment and best of ), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like , with an explicit bound in on the for the best of policy. We show in this paper that the information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of policy, that the upper bound can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
