Approximated Variational Bayesian Inverse Reinforcement Learning for   Large Language Model Alignment

Yuang Cai; Yuyu Yuan; Jinsheng Shi; Qinhong Lin

arXiv:2411.09341·cs.LG·November 15, 2024

Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

Yuang Cai, Yuyu Yuan, Jinsheng Shi, Qinhong Lin

PDF

Open Access 1 Video

TL;DR

This paper introduces a Bayesian inverse reinforcement learning approach with a novel variational method to improve large language model alignment by better utilizing feedback data and modeling intermediate rewards.

Contribution

It formulates LLM alignment as a BIRL problem and proposes AVA, a new variational approach for direct reward modeling and intermediate reward estimation.

Findings

01

AVRIL outperforms existing methods in reward modeling.

02

Enhanced utilization of feedback data improves LLM alignment.

03

Better intermediate reward modeling leads to improved generalization.

Abstract

The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsFocus · ALIGN