$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kiant\'e Brantley, Wen Sun

TL;DR
This paper introduces $Q\sharp$, a theoretically grounded, distributional RL algorithm for post-training LLMs that outperforms existing methods in reasoning tasks while maintaining low divergence from the reference policy.
Contribution
The paper presents $Q\sharp$, a novel value-based, distributional RL algorithm with provable optimality for KL-regularized RL, specifically tailored for LLM post-training.
Findings
$Q\sharp$ outperforms prior baselines in math reasoning benchmarks.
It maintains a smaller KL divergence to the reference policy.
Provides the first theoretical bounds for deterministic MDPs under realizability.
Abstract
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce , a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized function. We propose to learn the optimal function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized -values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI)
