Inverse-Q*: Token Level Reinforcement Learning for Aligning Large   Language Models Without Preference Data

Han Xia; Songyang Gao; Qiming Ge; Zhiheng Xi; Qi Zhang; Xuanjing Huang

arXiv:2408.14874·cs.CL·August 30, 2024

Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data

Han Xia, Songyang Gao, Qiming Ge, Zhiheng Xi, Qi Zhang, Xuanjing Huang

PDF

Open Access

TL;DR

This paper introduces Inverse-Q*, a novel token-level reinforcement learning framework that improves large language model alignment without requiring preference data or complex reward models, enhancing efficiency and stability.

Contribution

Inverse-Q* is a new method that estimates optimal policies directly from model responses, reducing reliance on human annotations and external supervision.

Findings

01

Matches or exceeds PPO in convergence speed

02

Effectively aligns responses with human preferences

03

Reduces need for preference data and hyper-parameter tuning

Abstract

Reinforcement Learning from Human Feedback (RLHF) has proven effective in aligning large language models with human intentions, yet it often relies on complex methodologies like Proximal Policy Optimization (PPO) that require extensive hyper-parameter tuning and present challenges in sample efficiency and stability. In this paper, we introduce Inverse-Q*, an innovative framework that transcends traditional RL methods by optimizing token-level reinforcement learning without the need for additional reward or value models. Inverse-Q* leverages direct preference optimization techniques but extends them by estimating the conditionally optimal policy directly from the model's responses, facilitating more granular and flexible policy shaping. Our approach reduces reliance on human annotation and external supervision, making it especially suitable for low-resource settings. We present extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsEntropy Regularization · Proximal Policy Optimization · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings