ShiQ: Bringing back Bellman to LLMs

Pierre Clavier; Nathan Grinsztajn; Raphael Avalos; Yannis Flet-Berliac; Irem Ergun; Omar D. Domingues; Eugene Tarassov; Olivier Pietquin; Pierre H. Richemond; Florian Strub; Matthieu Geist

arXiv:2505.11081·cs.LG·May 19, 2025

ShiQ: Bringing back Bellman to LLMs

Pierre Clavier, Nathan Grinsztajn, Raphael Avalos, Yannis Flet-Berliac, Irem Ergun, Omar D. Domingues, Eugene Tarassov, Olivier Pietquin, Pierre H. Richemond, Florian Strub, Matthieu Geist

PDF

Open Access

TL;DR

ShiQ introduces a theoretically grounded Q-learning approach for fine-tuning large language models, leveraging Bellman equations to improve sample efficiency and offline learning capabilities.

Contribution

The paper develops a novel loss function based on Bellman equations tailored for LLMs, enabling effective Q-learning-based fine-tuning.

Findings

01

ShiQ outperforms traditional RL fine-tuning methods.

02

Effective in both synthetic and real-world benchmarks.

03

Supports off-policy, token-wise learning in LLMs.

Abstract

The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Q-Learning