TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement   Learning from Human Feedback

Eunseop Yoon; Hee Suk Yoon; SooHwan Eom; Gunsoo Han; Daniel Wontae; Nam; Daejin Jo; Kyoung-Woon On; Mark A. Hasegawa-Johnson; Sungwoong Kim,; Chang D. Yoo

arXiv:2407.16574·cs.CL·December 10, 2024

TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae, Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim,, Chang D. Yoo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces TLCR, a novel method that assigns continuous, context-aware rewards at the token level for reinforcement learning from human feedback, improving language model alignment.

Contribution

The paper proposes TLCR, a discriminator-based approach that provides nuanced, continuous token rewards, addressing limitations of previous discrete reward methods in RLHF.

Findings

01

TLCR outperforms previous methods on open-ended generation benchmarks.

02

Continuous token rewards improve language model alignment.

03

Discriminator confidence effectively guides reward assignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

esyoon7/rlhf-tlcr
pytorchOfficial

Videos

TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback· underline

Taxonomy

TopicsMuscle activation and electromyography studies · Reinforcement Learning in Robotics

MethodsALIGN