$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou; Kaiwen Wang; Jonathan Chang; Zhaolin Gao; Nathan Kallus; Kilian Q. Weinberger; Kiant\'e Brantley; Wen Sun

arXiv:2502.20548·cs.LG·October 21, 2025

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kiant\'e Brantley, Wen Sun

PDF

Open Access

TL;DR

This paper introduces $Q\sharp$, a theoretically grounded, distributional RL algorithm for post-training LLMs that outperforms existing methods in reasoning tasks while maintaining low divergence from the reference policy.

Contribution

The paper presents $Q\sharp$, a novel value-based, distributional RL algorithm with provable optimality for KL-regularized RL, specifically tailored for LLM post-training.

Findings

01

$Q\sharp$ outperforms prior baselines in math reasoning benchmarks.

02

It maintains a smaller KL divergence to the reference policy.

03

Provides the first theoretical bounds for deterministic MDPs under realizability.

Abstract

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q ♯$ , a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$ -values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q ♯$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI)