KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Jason R Brown; Lennie Wells; Edward James Young; Sergio Bacallado

arXiv:2508.17000·cs.CL·August 26, 2025

KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Jason R Brown, Lennie Wells, Edward James Young, Sergio Bacallado

PDF

TL;DR

This paper introduces KL-regularised Q-Learning (KLQ), a new RL method for language model fine-tuning with human feedback, showing comparable or better performance than PPO on key tasks and evaluations.

Contribution

The paper develops KLQ, a novel action-value RL algorithm for LM-RLHF, and demonstrates its equivalence to PPO while providing improved evaluation metrics.

Findings

01

KLQ performs on par with PPO in optimizing LM-RLHF objectives.

02

KLQ achieves higher win-rates against PPO in LLM-as-a-judge evaluations.

03

KLQ offers a theoretically motivated alternative to heuristic PPO in language model fine-tuning.

Abstract

Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.