Dense Reward for Free in Reinforcement Learning from Human Feedback

Alex J. Chan; Hao Sun; Samuel Holt; Mihaela van der Schaar

arXiv:2402.00782·cs.LG·February 2, 2024·2 cites

Dense Reward for Free in Reinforcement Learning from Human Feedback

Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to densify reward signals in reinforcement learning from human feedback by using attention maps from the reward model, improving training stability and efficiency without extra costs.

Contribution

It leverages attention weights in the reward model to redistribute rewards along completions, enhancing reinforcement learning from sparse feedback without additional modeling or computation.

Findings

01

Densifies reward signals using attention maps.

02

Stabilizes and accelerates RL training.

03

Potentially improves local optima in practice.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed Large Language Models (LLMs) to effectively follow instructions and produce useful assistance. Classically, this involves generating completions from the LLM in response to a query before using a separate reward model to assign a score to the full completion. As an auto-regressive process, the LLM has to take many "actions" (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. In this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. We use these attention weights to redistribute the reward along the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xanderjc/attention-based-credit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications