Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment
Yuang Cai, Yuyu Yuan, Jinsheng Shi, Qinhong Lin

TL;DR
This paper introduces a Bayesian inverse reinforcement learning approach with a novel variational method to improve large language model alignment by better utilizing feedback data and modeling intermediate rewards.
Contribution
It formulates LLM alignment as a BIRL problem and proposes AVA, a new variational approach for direct reward modeling and intermediate reward estimation.
Findings
AVRIL outperforms existing methods in reward modeling.
Enhanced utilization of feedback data improves LLM alignment.
Better intermediate reward modeling leads to improved generalization.
Abstract
The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsFocus · ALIGN
