TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae, Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim,, Chang D. Yoo

TL;DR
This paper introduces TLCR, a novel method that assigns continuous, context-aware rewards at the token level for reinforcement learning from human feedback, improving language model alignment.
Contribution
The paper proposes TLCR, a discriminator-based approach that provides nuanced, continuous token rewards, addressing limitations of previous discrete reward methods in RLHF.
Findings
TLCR outperforms previous methods on open-ended generation benchmarks.
Continuous token rewards improve language model alignment.
Discriminator confidence effectively guides reward assignment.
Abstract
Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMuscle activation and electromyography studies · Reinforcement Learning in Robotics
MethodsALIGN
