T-REG: Preference Optimization with Token-Level Reward Regularization

Wenxuan Zhou; Shujian Zhang; Lingxiao Zhao; Tao Meng

arXiv:2412.02685·cs.CL·December 4, 2024

T-REG: Preference Optimization with Token-Level Reward Regularization

Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng

PDF

Open Access

TL;DR

T-REG introduces a novel token-level reward regularization method leveraging self-refinement of LLMs to improve preference optimization, resulting in better alignment and performance on instruction following benchmarks.

Contribution

The paper proposes T-REG, a new approach that uses self-generated token-level rewards for improved preference optimization in LLMs, addressing limitations of previous reward assignment methods.

Findings

01

Outperforms baseline methods by up to 3.8% and 4.4% on benchmarks.

02

Utilizes self-refinement of LLMs for token-level reward generation.

03

Enhances alignment performance through reward regularization.

Abstract

Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Bandit Algorithms Research · Constraint Satisfaction and Optimization